Finding Defective Modules from Highly Unbalanced Datasets

Many software engineering datasets are highly unbalanced, i.e., the number of instances of a one class outnumber the number of instances of the other class. In this work, we analyse two balancing techniques with two common classification algorithms using five open public datasets from the PROMISE repository in order to find defective modules. The results show that although balancing techniques may not improve the percentage of correctly classified instances, they do improve the AUC measure, i.e., they classify better those instances that belong to the minority class from the minority class.

[1]  Taghi M. Khoshgoftaar,et al.  Application of neural networks to software quality modeling of a very large telecommunications system , 1997, IEEE Trans. Neural Networks.

[2]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[3]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of the Classification Performance of Learners on Imbalanced and Noisy Software Quality Data , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[4]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[5]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[8]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[9]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[10]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[13]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[14]  Zhan Li,et al.  A practical method for the software fault-prediction , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[15]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[16]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[17]  Akito Monden,et al.  The Effects of Over and Under Sampling on Fault-prone Module Detection , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[18]  Norman E. Fenton,et al.  Software Measurement: Uncertainty and Causal Modeling , 2002, IEEE Softw..