Applying Novel Resampling Strategies To Software Defect Prediction

Due to the tremendous complexity and sophistication of software, improving software reliability is an enormously difficult task. We study the software defect prediction problem, which focuses on predicting which modules will experience a failure during operation. Numerous studies have applied machine learning to software defect prediction; however, skewness in defect-prediction datasets usually undermines the learning algorithms. The resulting classifiers will often never predict the faulty minority class. This problem is well known in machine learning and is often referred to as learning from unbalanced datasets. We examine stratification, a widely used technique for learning unbalanced data that has received little attention in software defect prediction. Our experiments are focused on the SMOTE technique, which is a method of over-sampling minority-class examples. Our goal is to determine if SMOTE can improve recognition of defect-prone modules, and at what cost. Our experiments demonstrate that after SMOTE resampling, we have a more balanced classification. We found an improvement of at least 23% in the average geometric mean classification accuracy on four benchmark datasets.

[1]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[2]  Taghi M. Khoshgoftaar,et al.  Software Quality Analysis of Unlabeled Program Modules With Semisupervised Clustering , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[3]  Fumio Akiyama,et al.  An Example of Software System Debugging , 1971, IFIP Congress.

[4]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[5]  Abraham Kandel,et al.  Computational Intelligence in Software Quality Assurance , 2005, Series in Machine Perception and Artificial Intelligence.

[6]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[7]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[8]  Norman E. Fenton,et al.  A Critique of Software Defect Prediction Models , 1999, IEEE Trans. Software Eng..

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[11]  Yashwant K. Malaiya,et al.  Neural networks for software reliability engineering , 1996 .

[12]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[13]  Linda H. Rosenberg,et al.  SOFTWARE METRICS AND RELIABILITY , 1998 .

[14]  Chris F. Kemerer,et al.  A Metrics Suite for Object Oriented Design , 2015, IEEE Trans. Software Eng..

[15]  Scott Dick,et al.  Fuzzy Clustering of Open-Source Software Quality Data: A Case Study of Mozilla , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[16]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[17]  A. E. Ferdinand A THEORY OF SYSTEM COMPLEXITY , 1974 .

[18]  Norm Brown,et al.  Industrial-Strength Management Strategies , 1996, IEEE Softw..

[19]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[20]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[21]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[22]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[23]  Taghi M. Khoshgoftaar,et al.  Classification-tree models of software-quality over multiple releases , 2000, IEEE Trans. Reliab..

[24]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[25]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[26]  Kai Ming Ting,et al.  Boosting Trees for Cost-Sensitive Classifications , 1998, ECML.

[27]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.