An Empirical Study of Different Types of Changes in the Eclipse Project

This paper studied the distribution of different types of changes in the various contexts of the system and the relationship between artifact (file and module) size and different changes. We used the change data in the open source Eclipse Project through its decade-long evolution history. The latest release has 220 modules, 33904 files, 3780201 lines of code, and 49853 changes (accumulatively). This study focused on two levels of software artifacts: module and file; and four contexts of changes: all changes, error changes, non-error changes, and 19 change categories. At the module level, we found that the power-law distribution was a common phenomenon for three contexts of changes at both the module and file levels: it existed in all changes, in error changes, and in non-error changes. When we analyzed the 19 change categories, the files and modules exhibited different behavior: the power-law distribution existed in all but one category at the module level, but, about two third of the change categories did not show the power-law distribution at the file level. On the relationship between artifact size and changes, we found, at the module level, a few modules that had the majority of changes accounted for the majority of the code size; however, this phenomenon disappeared when we separated the er- ror from non-error changes. At the file level, this phenomenon did not exist at all. We did not find any correlation between artifact size and changes at either the module or file level.

[1]  Les Hatton,et al.  Reexamining the Fault Density-Component Size Connection , 1997, IEEE Softw..

[2]  Nilson Arrais Quality control handbook , 1966 .

[3]  Carsten Görg,et al.  Error detection by refactoring reconstruction , 2005, MSR '05.

[4]  Per Runeson,et al.  A Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems , 2007, IEEE Transactions on Software Engineering.

[5]  Norman E. Fenton,et al.  Quantitative Analysis of Faults and Failures in a Complex Software System , 2000, IEEE Trans. Software Eng..

[6]  Zengchang Qin,et al.  Naive Bayes Classification Given Probability Estimation Trees , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[7]  Khaled El Emam,et al.  The Confounding Effect of Class Size on the Validity of Object-Oriented Metrics , 2001, IEEE Trans. Software Eng..

[8]  Elliot Soloway,et al.  Where the bugs are , 1985, CHI '85.

[9]  Allan G. Bluman Elementary Statistics: A Step By Step Approach , 1980 .

[10]  Sigrid Eldh Software Testing Techniques , 2007 .

[11]  Niclas Ohlsson,et al.  Predicting Fault-Prone Software Modules in Telephone Switches , 1996, IEEE Trans. Software Eng..

[12]  Yuanyuan Zhou,et al.  Have things changed now?: an empirical study of bug characteristics in modern open source software , 2006, ASID '06.

[13]  Boris Beizer,et al.  Software testing techniques (2. ed.) , 1990 .

[14]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[15]  James Noble,et al.  Scale-free Geometry in Object-Oriented Programs , 2004 .

[16]  Tze-Jie Yu,et al.  Identifying Error-Prone Software—An Empirical Study , 1985, IEEE Transactions on Software Engineering.

[17]  Les Hatton,et al.  Power-Law Distributions of Component Size in General Software Systems , 2009, IEEE Transactions on Software Engineering.

[18]  Victor R. Basili,et al.  Software errors and complexity: an empirical investigation0 , 1984, CACM.

[19]  H. Inoue Verifying Power-Law Distribution in Empirical Data , 2010 .

[20]  Per Runeson,et al.  A Second Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems , 2007, IEEE Transactions on Software Engineering.

[21]  Sunghun Kim,et al.  Bug Classification Using Program Slicing Metrics , 2006, 2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation.

[22]  Barbara Paech,et al.  The Vital Few and Trivial Many: An Empirical Analysis of the Pareto Distribution of Defects , 2009, Software Engineering.

[23]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[24]  Jiawei Han,et al.  Classification of software behaviors for failure detection: a discriminative pattern mining approach , 2009, KDD.

[25]  Victor R. Basili,et al.  Software errors and complexity: an empirical investigation , 1993 .

[26]  Daniel J. Paulish,et al.  An empirical investigation of software fault distribution , 1993, [1993] Proceedings First International Software Metrics Symposium.