On the Distribution of Source Code File Sizes

Source code size is an estimator of software effort. Size is also often used to calibrate models and equations to estimate the cost of software. The distribution of source code file sizes has been shown in the literature to be a lognormal distribution. In this paper, we measure the size of a large collection of software (the Debian GNU/Linux distribution version 5.0.2), and we find that the statistical distribution of its source code file sizes follows a double Pareto distribution. This means that large files are to be found more often than predicted by the lognormal distribution, therefore the previously proposed models underestimate the cost of software.

[1]  Jesús M. González-Barahona,et al.  Towards a Theoretical Model for Software Growth , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[2]  Michele Marchesi,et al.  The Distribution of Program Sizes and Its Implications: An Eclipse Case Study , 2009, ArXiv.

[3]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[4]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[5]  Donald E. Knuth,et al.  An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..

[6]  Gregorio Robles,et al.  Evolution of Volunteer Participation in Libre Software Projects: Evidence from Debian , 2005 .

[7]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[8]  Harvey P. Siy,et al.  Predicting Fault Incidence Using Software Change History , 2000, IEEE Trans. Software Eng..

[9]  Daniel M. Germán,et al.  Macro-level software evolution: a case study of a large software compilation , 2009, Empirical Software Engineering.

[10]  Michele Marchesi,et al.  Power-Laws in a Large Object-Oriented Software System , 2007, IEEE Transactions on Software Engineering.

[11]  Michael Mitzenmacher,et al.  Dynamic Models for File Sizes and Double Pareto Distributions , 2004, Internet Math..

[12]  Israel Herraiz Tabernero A statistical examination of the evolution and properties of libre software , 2012 .

[13]  Scott N. Woodfield,et al.  The effect of modularization and comments on program comprehension , 1981, ICSE '81.

[14]  Barry W. Boehm,et al.  Software Engineering Economics , 1993, IEEE Transactions on Software Engineering.

[15]  Ewan D. Tempero,et al.  Understanding the shape of Java software , 2006, OOPSLA '06.

[16]  Capers Jones Backfiring: Converting Lines of Code to Function Points , 1995, Computer.

[17]  Douglas W. Clark,et al.  An empirical study of list structure in Lisp , 1977, CACM.

[18]  Diomidis Spinellis,et al.  Power laws in software , 2008, TSEM.

[19]  Michael Mitzenmacher,et al.  A Brief History of Generative Models for Power Law and Lognormal Distributions , 2004, Internet Math..