A Model of the Commit Size Distribution of Open Source

A fundamental unit of work in programming is the code contribution ("commit") that a developer makes to the code base of the project in work. We use statistical methods to derive a model of the probabilistic distribution of commit sizes in open source projects and we show that the model is applicable to different project sizes. We use both graphical as well as statistical methods to validate the goodness of fit of our model. By measuring and modeling a fundamental dimension of programming we help improve software development tools and our understanding of software development.

[1]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[2]  Harald C. Gall,et al.  Proceedings of the 2006 international workshop on Mining software repositories , 2006, International Conference on Software Engineering.

[3]  Subhabrata Chakraborti,et al.  Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[4]  Andreas Zeller,et al.  Guest Editors' Introduction: Mining Software Archives , 2009, IEEE Software.

[5]  Gerardo Canfora,et al.  Ldiff: An enhanced line differencing tool , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[6]  Stephan Diehl,et al.  Small patches get in! , 2008, MSR '08.

[7]  Harald C. Gall,et al.  Towards software analysis as a service , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering - Workshops.

[8]  Dirk Riehle,et al.  Estimating Commit Sizes Efficiently , 2009, OSS.

[9]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[10]  Giancarlo Succi,et al.  An empirical study of open-source and closed-source software products , 2004, IEEE Transactions on Software Engineering.

[11]  Vijay P. Singh,et al.  Parameter estimation for 3-parameter generalized pareto distribution by the principle of maximum entropy (POME) , 1995 .

[12]  Jonathan I. Maletic,et al.  What's a Typical Commit? A Characterization of Open Source Software Repositories , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[13]  Eric P. Smith,et al.  An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[14]  Cornelia Boldyreff,et al.  Evolutionary Success of Open Source Software: an Investigation into Exogenous Drivers , 2008, Electron. Commun. Eur. Assoc. Softw. Sci. Technol..

[15]  Dirk Riehle,et al.  Continuous Integration in Open Source Software Development , 2008, OSS.

[16]  Dirk Riehle,et al.  The Commit Size Distribution of Open Source Software , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[17]  Daniel M. Germán,et al.  What do large commits tell us?: a taxonomical study of large commits , 2008, MSR '08.

[18]  K. Vairavan,et al.  An Experimental Investigation of Software Metrics and Their Relationship to Software Development Effort , 1989, IEEE Trans. Software Eng..

[19]  Dewayne E. Perry,et al.  Toward understanding the rhetoric of small source code changes , 2005, IEEE Transactions on Software Engineering.