Misclassification Cost-Sensitive Software Defect Prediction

Software defect prediction helps developers focus on defective modules for efficient software quality assurance. A common goal shared by existing software defect prediction methods is to attain low classification error rates. These proposals suffer from two practical problems: (i) Most of the prediction methods rely on a large number of labeled training data. However, collecting labeled data is a difficult and expensive task. It is hard to obtain classification labels over new software projects or existing projects without historical defect data. (ii) Software defect datasets are highly imbalanced. In many real-world applications, the misclassification cost of defective modules is generally several times higher than that of non-defective ones. In this paper, we present a misclassification Cost-sensitive approach to Software Defect Prediction (CSDP). The CSDP approach is novel in two aspects: First, CSDP addresses the problem of unlabeled software detect datasets by combining an unsupervised sampling method with a domain specific misclassification cost model. This preprocessing step selectively samples a small percentage of modules through estimating their classification labels. Second, CSDP builds a cost-sensitive support vector machine model to predict defect-proneness of the rest of modules with both overall classification error rate and domain specific misclassification cost as quality metrics. CSDP is evaluated on four NASA projects. Experimental results highlight three interesting observations: (1) CSDP achieves higher Normalized Expected Cost of Misclassification (NECM) compared with state-of-art supervised learning models under imbalanced training data with limited labeling. (2) CSDP outperforms state-of-art semi-supervised learning methods, which disregards classification costs, especially in recall rate. (3) CSDP enhanced through unsupervised sampling as a preprocessing step prior to training and prediction outperforms the baseline CSDP without the sampling process.

[1]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[2]  Qing Li,et al.  Three-way decisions based software defect prediction , 2016, Knowl. Based Syst..

[3]  Qinbao Song,et al.  Software defect association mining and defect correction effort prediction , 2006, IEEE Transactions on Software Engineering.

[4]  Daoqiang Zhang,et al.  Two-Stage Cost-Sensitive Learning for Software Defect Prediction , 2014, IEEE Transactions on Reliability.

[5]  Shunzhi Zhu,et al.  An improved semi-supervised learning method for software defect prediction , 2014, J. Intell. Fuzzy Syst..

[6]  Zhi-Hua Zhou,et al.  Sample-based software defect prediction with active and semi-supervised learning , 2012, Automated Software Engineering.

[7]  Akito Monden,et al.  The Effects of Over and Under Sampling on Fault-prone Module Detection , 2007, First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007).

[8]  Qinbao Song,et al.  Using Coding-Based Ensemble Learning to Improve Software Defect Prediction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[9]  Taghi M. Khoshgoftaar,et al.  Tree-based software quality estimation models for fault prediction , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[10]  Cagatay Catal,et al.  Software fault prediction: A literature review and current trends , 2011, Expert Syst. Appl..

[11]  Bojan Cukic,et al.  An iterative semi-supervised approach to software fault prediction , 2011, Promise '11.

[12]  Zhi-Hua Zhou,et al.  Towards Making Unlabeled Data Never Hurt , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Taghi M. Khoshgoftaar,et al.  Software quality estimation with limited fault data: a semi-supervised learning perspective , 2007, Software Quality Journal.

[14]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[15]  Md Zahidul Islam,et al.  Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem , 2015, Inf. Syst..

[16]  Zhi-Hua Zhou,et al.  Semi-supervised learning using label mean , 2009, ICML '09.

[17]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[18]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[19]  Zhi-Hua Zhou,et al.  Software Defect Detection with Rocus , 2011, Journal of Computer Science and Technology.

[20]  Lars Schmidt-Thieme,et al.  Cost-sensitive learning methods for imbalanced data , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[21]  Letha H. Etzkorn,et al.  Empirical Validation of Three Software Metrics Suites to Predict Fault-Proneness of Object-Oriented Classes Developed Using Highly Iterative or Agile Software Development Processes , 2007, IEEE Transactions on Software Engineering.

[22]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[23]  Bojan Cukic,et al.  Software defect prediction using semi-supervised learning with dimension reduction , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[24]  Zhi-Hua Zhou,et al.  Cost-Sensitive Semi-Supervised Support Vector Machine , 2010, AAAI.

[25]  Taghi M. Khoshgoftaar,et al.  Unsupervised learning for expert-based software quality estimation , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[26]  Michael R. Lyu,et al.  A novel method for early software quality prediction based on support vector machine , 2005, 16th IEEE International Symposium on Software Reliability Engineering (ISSRE'05).

[27]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets , 2015, ASE 2015.