An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems

Background. Solving the class-imbalance problem of within-project software defect prediction (SDP) is an important research topic. Although some class-imbalance learning methods have been presented, there exists room for improvement. For cross-project SDP, we found that the class-imbalanced source usually leads to misclassification of defective instances. However, only one work has paid attention to this cross-project class-imbalance problem. Objective. We aim to provide effective solutions for both within-project and cross-project class-imbalance problems. Method. Subclass discriminant analysis (SDA), an effective feature learning method, is introduced to solve the problems. It can learn features with more powerful classification ability from original metrics. For within-project prediction, we improve SDA for achieving balanced subclasses and propose the improved SDA (ISDA) approach. For cross-project prediction, we employ the semi-supervised transfer component analysis (SSTCA) method to make the distributions of source and target data consistent, and propose the SSTCA+ISDA prediction approach. Results. Extensive experiments on four widely used datasets indicate that: 1) ISDA-based solution performs better than other state-of-the-art methods for within-project class-imbalance problem; 2) SSTCA+ISDA proposed for cross-project class-imbalance problem significantly outperforms related methods. Conclusion. Within-project and cross-project class-imbalance problems greatly affect prediction performance, and we provide a unified and effective prediction framework for both problems.

[1]  Andrea De Lucia,et al.  Cross-project defect prediction models: L'Union fait la force , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[2]  Sashank Dara,et al.  Online Defect Prediction for Imbalanced Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[3]  Koichiro Ochimizu,et al.  Towards logistic regression models for predicting fault-prone code across software projects , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[4]  Naoyasu Ubayashi,et al.  An empirical study of just-in-time defect prediction using cross-project models , 2014, MSR 2014.

[5]  Francisco Herrera,et al.  On the use of MapReduce for imbalanced big data using Random Forest , 2014, Inf. Sci..

[6]  Premkumar T. Devanbu,et al.  Comparing static bug finders and statistical prediction , 2014, ICSE.

[7]  Michele Lanza,et al.  Evaluating defect prediction approaches: a benchmark and an extensive comparison , 2011, Empirical Software Engineering.

[8]  Dimitris Kanellopoulos,et al.  Data Preprocessing for Supervised Leaning , 2007 .

[9]  Paulo J. L. Adeodato,et al.  Predicting software defects: A cost-sensitive approach , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[10]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[11]  Jun Wang,et al.  Compressed C4.5 Models for Software Defect Prediction , 2012, 2012 12th International Conference on Quality Software.

[12]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[13]  Murali Krishna Ramanathan,et al.  Scalable and incremental software bug detection , 2013, ESEC/FSE 2013.

[14]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[15]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[16]  Nachiappan Nagappan,et al.  Predicting defects using network analysis on dependency graphs , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[17]  Jin Liu,et al.  Dictionary learning based software defect prediction , 2014, ICSE.

[18]  Xinli Yang,et al.  Deep Learning for Just-in-Time Defect Prediction , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[19]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[20]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[21]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[22]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[23]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[24]  Audris Mockus,et al.  Towards building a universal defect prediction model , 2014, MSR 2014.

[25]  Aleix M. Martínez,et al.  Subclass discriminant analysis , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Tong-Seng Quah,et al.  Application of neural networks for software quality prediction using object-oriented metrics , 2003, International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings..

[27]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[28]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[29]  Yue Jiang,et al.  Techniques for evaluating fault prediction models , 2008, Empirical Software Engineering.

[30]  Rongxin Wu,et al.  ReLink: recovering links between bugs and changes , 2011, ESEC/FSE '11.

[31]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[32]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[33]  Xin Yao,et al.  How to make best use of cross-company data in software effort estimation? , 2014, ICSE.

[34]  Haruhiko Kaiya,et al.  Adapting a fault prediction model to allow inter languagereuse , 2008, PROMISE '08.

[35]  Tim Menzies,et al.  Software Analytics: So What? , 2013, IEEE Softw..

[36]  Tim Menzies,et al.  Learning from Open-Source Projects: An Empirical Study on Defect Prediction , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[37]  Lucas Layman,et al.  LACE2: Better Privacy-Preserving Data Sharing for Cross Project Defect Prediction , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[38]  Alfredo Petrosino,et al.  Adjusted F-measure and kernel scaling for imbalanced data learning , 2014, Inf. Sci..

[39]  Ying Ma,et al.  On Software Defect Prediction Using Machine Learning , 2014, J. Appl. Math..

[40]  Michael M. Richter,et al.  Defect Prediction using Case-Based Reasoning: an Attribute Weighting Technique Based upon sensitivity Analysis in Neural Networks , 2012, Int. J. Softw. Eng. Knowl. Eng..

[41]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[42]  Premkumar T. Devanbu,et al.  Sample size vs. bias in defect prediction , 2013, ESEC/FSE 2013.

[43]  S. Nickolas,et al.  Feature Selection Using Decision Tree Induction in Class level Metrics Dataset for Software Defect Predictions , 2010 .

[44]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[45]  Tim Menzies,et al.  Transfer learning in effort estimation , 2015, Empirical Software Engineering.

[46]  Burak Turhan,et al.  Implications of ceiling effects in defect predictors , 2008, PROMISE '08.

[47]  Qiang Yang,et al.  Cross-domain sentiment classification via spectral feature alignment , 2010, WWW '10.

[48]  Yue Jiang,et al.  Cost Curve Evaluation of Fault Prediction Models , 2008, 2008 19th International Symposium on Software Reliability Engineering (ISSRE).

[49]  Jongmoon Baik,et al.  Value-cognitive boosting with a support vector machine for cross-project defect prediction , 2014, Empirical Software Engineering.

[50]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.

[51]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[52]  Taghi M. Khoshgoftaar,et al.  The use of decision trees for cost‐sensitive classification: an empirical study in software quality prediction , 2011, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[53]  Taghi M. Khoshgoftaar,et al.  Improving Software-Quality Predictions With Data Sampling and Boosting , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[54]  Reid Holmes,et al.  Coverage is not strongly correlated with test suite effectiveness , 2014, ICSE.

[55]  Naoyasu Ubayashi,et al.  Studying just-in-time defect prediction using cross-project models , 2015, Empirical Software Engineering.

[56]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[57]  Huanhuan Chen,et al.  Negative correlation learning for classification ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[58]  N. Cliff Dominance statistics: Ordinal analyses to answer ordinal questions. , 1993 .

[59]  Scott Dick,et al.  Evaluating Stratification Alternatives to Improve Software Defect Prediction , 2012, IEEE Transactions on Reliability.

[60]  Gerardo Canfora,et al.  Multi-objective Cross-Project Defect Prediction , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[61]  Bruce Christianson,et al.  Using the Support Vector Machine as a Classification Method for Software Defect Prediction with Static Code Metrics , 2009, EANN.

[62]  Ping Guo,et al.  Software Defect Prediction Using Fuzzy Support Vector Regression , 2010, ISNN.

[63]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[64]  María José del Jesús,et al.  A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets , 2008, Fuzzy Sets Syst..

[65]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[66]  Audris Mockus,et al.  A large-scale empirical study of just-in-time quality assurance , 2013, IEEE Transactions on Software Engineering.

[67]  Tao Wang,et al.  Naive Bayes Software Defect Prediction Model , 2010, 2010 International Conference on Computational Intelligence and Software Engineering.

[68]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[69]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[70]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[71]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[72]  Xin Yao,et al.  A Learning-to-Rank Approach to Software Defect Prediction , 2015, IEEE Transactions on Reliability.

[73]  Ivor W. Tsang,et al.  Domain Transfer SVM for video concept detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[74]  Daoqiang Zhang,et al.  Two-Stage Cost-Sensitive Learning for Software Defect Prediction , 2014, IEEE Transactions on Reliability.

[75]  Qinbao Song,et al.  Using Coding-Based Ensemble Learning to Improve Software Defect Prediction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[76]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[77]  Qiang Yang,et al.  Transferring Naive Bayes Classifiers for Text Classification , 2007, AAAI.

[78]  Ayse Basar Bener,et al.  Analysis of Naive Bayes' assumptions on software fault data: An empirical study , 2009, Data Knowl. Eng..