Are 20% of files responsible for 80% of defects?

Background: Over the past two decades a mixture of anecdote from the industry and empirical studies from academia have suggested that the 80:20 rule (otherwise known as the Pareto Principle) applies to the relationship between source code files and the number of defects in the system: a small minority of files (roughly 20%) are responsible for a majority of defects (roughly 80%). Aims: This paper aims to establish how widespread the phenomenon is by analysing 100 systems (previous studies have focussed on between one and three systems), with the goal of whether and under what circumstances this relationship does hold, and whether the key files can be readily identified from basic metrics. Method: We devised a search criterion to identify defect fixes from commit messages and used this to analyse 100 active Github repositories, spanning a variety of languages and domains. We then studied the relationship between files, basic metrics (churn and LOC), and defect fixes. Results: We found that the Pareto principle does hold, but only if defects that incur fixes to multiple files count as multiple defects. When we investigated multi-file fixes, we found that key files (belonging to the top 20%) are commonly fixed alongside other much less frequently-fixed files. We found LOC to be poorly correlated with defect proneness, Code Churn was a more reliable indicator, but only for extremely high values of Churn. Conclusions: It is difficult to reliably identify the "most fixed" 20% of files from basic metrics. However, even if they could be reliably predicted, focussing on them would probably be misguided. Although fixes will naturally involve files that are often involved in other fixes too, they also tend to include other less frequently-fixed files.

[1]  Harald C. Gall,et al.  Putting It All Together: Using Socio-technical Networks to Predict Failures , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[2]  Andreas Zeller,et al.  Predicting component failures at design time , 2006, ISESE '06.

[3]  Elaine J. Weyuker,et al.  Predicting the location and number of faults in large software systems , 2005, IEEE Transactions on Software Engineering.

[4]  J. Marchal Cours d'economie politique , 1950 .

[5]  Diomidis Spinellis,et al.  Power laws in software , 2008, TSEM.

[6]  Barry W. Boehm,et al.  Software Defect Reduction Top 10 List , 2001, Computer.

[7]  Per Runeson,et al.  A Second Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems , 2007, IEEE Transactions on Software Engineering.

[8]  Meiyappan Nagappan,et al.  Curating GitHub for engineered software projects , 2017, Empirical Software Engineering.

[9]  Guanrong Chen,et al.  Complex networks: small-world, scale-free and beyond , 2003 .

[10]  Yuming Zhou,et al.  On the ability of complexity metrics to predict fault-prone classes in object-oriented systems , 2010, J. Syst. Softw..

[11]  Chin-Yu Huang,et al.  A study of applying the bounded Generalized Pareto distribution to the analysis of software fault distribution , 2010, 2010 IEEE International Conference on Industrial Engineering and Engineering Management.

[12]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[13]  César A. Hidalgo,et al.  Scale-free networks , 2008, Scholarpedia.

[14]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[15]  Mithun Acharya,et al.  Practical change impact analysis based on static program slicing for industrial software systems , 2012, SIGSOFT FSE.

[16]  Norman E. Fenton,et al.  Quantitative Analysis of Faults and Failures in a Complex Software System , 2000, IEEE Trans. Software Eng..

[17]  X. Gabaix Zipf's Law for Cities: An Explanation , 1999 .

[18]  Hareton K. N. Leung,et al.  A survey of code‐based change impact analysis techniques , 2013, Softw. Test. Verification Reliab..

[19]  Andreas Zeller,et al.  Change Bursts as Defect Predictors , 2010, 2010 IEEE 21st International Symposium on Software Reliability Engineering.

[20]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[21]  R. Ferrer i Cancho,et al.  Scale-free networks from optimal design , 2002, cond-mat/0204344.

[22]  Tracy Hall,et al.  Evaluating Three Approaches to Extracting Fault Data from Software Change Repositories , 2010, PROFES.

[23]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[24]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[25]  Anand Mendepalli Connecting with Customers , 2012 .

[26]  N. Nagappan,et al.  Use of relative code churn measures to predict system defect density , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[27]  Steve Counsell,et al.  Power law distributions in class relationships , 2003, Proceedings Third IEEE International Workshop on Source Code Analysis and Manipulation.

[28]  Michele Marchesi,et al.  On the Distribution of Bugs in the Eclipse System , 2011, IEEE Transactions on Software Engineering.

[29]  Per Runeson,et al.  A Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems , 2007, IEEE Transactions on Software Engineering.

[30]  Charles Gide,et al.  Cours d'économie politique , 1911 .

[31]  Ewan D. Tempero,et al.  Understanding the shape of Java software , 2006, OOPSLA '06.

[32]  R A Stephens CONNECTING WITH CUSTOMERS , 1991 .