The aim of this repository is to collect all the publicly available software estimation datasets which also includes the corresponding actual implementation effort.
Entries are listed by order of number of rows in the dataset, largest first.
Note: it's common for paper X to cite Y as the source of the data, without checking to find that Y cites Z as the source of the data (occasionally the citation chain is longer).
The arff file format contains embedded information about the data.
If you know of any datasets that are missing, and you have the data, please send me a copy. Also, if you spot any mistakes, please let me know.
Valdes-Souto.csv
F. Valdés-Soutoa, and J. Valeriano-Assem, "Merging Distinct Sources Databases to Improve Software Estimation Models", Programming and Computer Software, 2024, Vol. 50(8), pp. 786–795.
57 rows
Project totals of COSMIC function points and person hours. Data from a Mexican company, the International Software Benchmarking Standards Group (ISBSG), and the Mexican Software Metrics Association (AMMS).
CESAW.tgz
Derek M. Jones, William R. Nichols, "The CESAW dataset: a conversation", Jun 2021, arXiv:cs.SE/2106.03679 .
203,621 rows
Estimate/actual in person hours for small tasks.
renzo-pomodoro.csv
Derek M. Jones, "The Renzo Pomodoro dataset", Dec 2019, The Shape of Code
17,764 rows
Estimate/actual in Pomodoros for small tasks.
Subbiah.csv
C. Subbiah, "Task-Based Estimation and Planning for Application Development Projects and Resources: Models, Methods and Applications", PhD thesis Dec 2019, University of Missouri – St. Louis.
72 rows
Function points.
Derek M. Jones, Stephen Cullum "A conversation around the analysis of the SiP effort estimation dataset", Jan 2019, arXiv:cs.SE/1901.01621.
10,100 rows
Estimate/actual in person hours for small tasks.
Derek M. Jones "Small team estimating in story points; a project dataset", Feb 2023, blog post.
630 rows
Estimates in Story points and actuals in hours for small tasks.
china.arff
Fang Hon Yun, "China: Effort Estimation Dataset", Apr 2010, Zenodo.
499 rows
Estimate in Function points and actual in person hours for large tasks (i.e., years).
Huijgens492.zip
Hennie Huijgens, Arie van Deursen, Rini van Solingen, "The effects of perceived value and stakeholder satisfaction on software project impact", Sep 2017, Information and Software Technology volume 89, pp 19-36.
492 rows
Estimates in Function points and actuals in Euros/person hours for medium size projects.
kitchenham.arff
Barbara Kitchenham, Shari Lawrence Pfleeger, Beth McColl and Suzanne Eagan, “An empirical study of maintenance and development estimation accuracy”, Oct 2002, Journal of Systems and Software, volume 64(1), pp.57-77.
145 rows
Estimates in Function points/hours and actuals in person hours for large projects.
nasa93.arff
T. Menzies, D. Port, Z. Chen, J. Hihn and S. Stukes, "Validation Methods for Calibrating Software Effort Models", May 2005, 27th International Conference on Software Engineering, pp 587-595.
93 rows
Estimates in thousands of lines of code and actuals in person months for large projects.
Desharnais.csv
Jean-Marc Desharnais, "Analyse Statistique de la Productivite des Projets de Developpement en Informatique a Partir de la Technique des Points de Fonction" (Statistical Analysis on the Productivity of Data Processing with Development Projects using the Function Point Technique), Dec 1988, MSc, Université du Québec à Montréal.
80 rows
Estimate in Function points and actual in person hours for large tasks (i.e., years).
UCP_Dataset.csv
Radek Silhavy, Petr Silhavy and Zdenka Prokopova, "Analysis and selection of a regression model for the Use Case Points method using a stepwise approach", Mar 2017, The Journal of Systems and Software, volume 125(C), pp 1-14.
71 rows
Estimates in Use case points and actuals in person hours for large projects.
COCOMO-81.csv
Barry W. Boehm, "Software Engineering Economics", 1981, Prentice-Hall, Inc.
63 rows
Estimate in lines of code and actual in person months for large projects.
Maxwell.arff
K. D. Maxwell, "Applied Statistics for Software Managers", Prentice-Hall, 2002,
62 rows
Estimates in Function points and actuals in person hours for large projects.
miyazaki94.csv
Miyazaki, M. Terakado, K. Ozaki, H. Nozaki, "Robust regression for developing software estimation models", Oct 1994, Journal of Systems and Software, volume 27(1), pp. 3-16.
48 rows
Estimates in lines of code and actuals in man-months for large projects.
Finnish.arff
B. Sigweni and M. Shepperd "Using Blind Analysis for Software Engineering Experiments", 2015, Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, pp. 1-6.
38 rows
Estimates in Function points and actuals in person months for large projects.
Albrecht.arff
A.J. Albrecht and J.E. Gaffney, "Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation", Nov 1983, IEEE Transactions on Software Engineering volume SE-9(6), pp 639-648.
24 rows
Estimates in Function points and actuals in thousands of person hours for large projects.