Stability of Software Defect Prediction in Relation to Levels of Data Imbalance

Software defect prediction is recognized as one of the most important ways to reach software development efficiency. The majority of costs during software development is spent on software defect detection activities, but their ability to guarantee software reliability is still limited. The analyses performed by [Andersson and Runeson 2007; Fenton and Ohlsson 2000; Galinac Grbac et al. 2013], in the environment of a large scale industrial software with high focus on reliability shows that faults are distributed within the system according to the Pareto principle. They prove that the majority of faults are concentrated in just small amount of system modules, and that these modules do not compose a majority of system size. This fact implies that software defect prediction would really bring benefits if a well performing model is applied. The main motivating idea is that if we were able to predict the location of software faults within the system, then we could plan defect detection activities more efficiently. This means that we would be able to concentrate defect detection activities and resources into critical locations within the system and not on the entire system. Numerous studies have already been performed aiming to find the best general software defect prediction model [Hall et al. 2012]. Unfortunately, a well performing solution is still absent. Data in software defect prediction are very complex, and do not follow in general any particular probability distribution that could provide a mathematical model. Data distributions are highly skewed, which is connected to the popular data imbalance problem, thus making standard machine learning approaches inadequate. Therefore, a significant research has recently been devoted to cope with this problem.

[1]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[2]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[3]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[4]  Claes Wohlin,et al.  A Classification Scheme for Studies on Fault-Prone Components , 2001, PROFES.

[5]  Bruce Christianson,et al.  Reflections on the NASA MDP data sets , 2012, IET Softw..

[6]  Bojana Dalbelo Basic,et al.  Multivariate logistic regression prediction of fault-proneness in software modules , 2012, 2012 Proceedings of the 35th International Convention MIPRO.

[7]  Bruce Christianson,et al.  The misuse of the NASA metrics data program data sets for automated software defect prediction , 2011, EASE.

[8]  Martin J. Shepperd,et al.  Comparing Software Prediction Techniques Using Simulation , 2001, IEEE Trans. Software Eng..

[9]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[10]  Norman E. Fenton,et al.  Quantitative Analysis of Faults and Failures in a Complex Software System , 2000, IEEE Trans. Software Eng..

[11]  Per Runeson,et al.  A Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems , 2007, IEEE Transactions on Software Engineering.

[12]  Yue Jiang,et al.  Techniques for evaluating fault prediction models , 2008, Empirical Software Engineering.

[13]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[15]  Taghi M. Khoshgoftaar,et al.  An Empirical Study on the Stability of Feature Selection for Imbalanced Software Engineering Data , 2012, 2012 11th International Conference on Machine Learning and Applications.

[16]  Andy Brooks,et al.  Meta Analysis—A Silver Bullet—for Meta-Analysts , 1997, Empirical Software Engineering.

[17]  Taghi M. Khoshgoftaar,et al.  Detection of fault-prone software modules during a spiral life cycle , 1996, 1996 Proceedings of International Conference on Software Maintenance.

[18]  Taghi M. Khoshgoftaar,et al.  Software Defect Prediction for High-Dimensional and Class-Imbalanced Data , 2011, SEKE.

[19]  Harald C. Gall,et al.  Comparing fine-grained source code changes and code churn for bug prediction , 2011, MSR '11.

[20]  Lionel C. Briand,et al.  A Comprehensive Empirical Validation of Product Measures for Object-Oriented Systems , 1998 .

[21]  Catherine Stringfellow,et al.  Quantitative Analysis of Development Defects to Guide Testing: A Case Study , 2001, Software Quality Journal.

[22]  Taghi M. Khoshgoftaar,et al.  Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[23]  Arvinder Kaur,et al.  Empirical study of Software Quality estimation , 2012, CCSEIT '12.

[24]  Taghi M. Khoshgoftaar,et al.  Detecting noisy instances with the rule-based classification model , 2005, Intell. Data Anal..

[25]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[26]  Per Runeson,et al.  A Second Replicated Quantitative Analysis of Fault Distributions in Complex Software Systems , 2007, IEEE Transactions on Software Engineering.

[27]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[28]  Atul Gupta,et al.  Investigating fault prediction capabilities of five prediction models for software quality , 2012, SAC '12.

[29]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[30]  Akito Monden,et al.  The Effects of Over and Under Sampling on Fault-prone Module Detection , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[31]  Nachiappan Nagappan,et al.  Predicting defects using network analysis on dependency graphs , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[32]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[33]  Lionel C. Briand,et al.  A comprehensive empirical validation of design measures for object-oriented systems , 1998, Proceedings Fifth International Software Metrics Symposium. Metrics (Cat. No.98TB100262).

[34]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[35]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[36]  Taghi M. Khoshgoftaar,et al.  Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study , 2004, Empirical Software Engineering.

[37]  Ming Zhao,et al.  Application of multivariate analysis for software fault prediction , 1998, Software Quality Journal.

[38]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[40]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[41]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.