Cross-Project and Within-Project Semisupervised Software Defect Prediction: A Unified Approach

When there exist not enough historical defect data for building an accurate prediction model, semisupervised defect prediction (SSDP) and cross-project defect prediction (CPDP) are two feasible solutions. Existing CPDP methods assume that the available source data are well labeled. However, due to expensive human efforts for labeling a large amount of defect data, usually, we can only utilize the suitable unlabeled source data. We call CPDP in this scenario as cross-project semisupervised defect prediction (CSDP). Although some within-project semisupervised defect prediction (WSDP) methods have been developed in recent years, there still exists much room for improvement on prediction performance. In this paper, we aim to provide a unified and effective solution for both CSDP and WSDP problems. We introduce the semisupervised dictionary learning technique and propose a cost-sensitive kernelized semisupervised dictionary learning (CKSDL) approach. CKSDL can make full use of the limited labeled defect data and a large amount of unlabeled data in the kernel space. In addition, CKSDL considers the misclassification costs in the dictionary learning process. Extensive experiments on 16 projects indicate that CKSDL outperforms state-of-the-art WSDP methods, using unlabeled cross-project defect data can help improve the WSDP performance, and CKSDL generally obtains significantly better prediction performance than related SSDP methods in the CSDP scenario.

[1]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[2]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[3]  Banu Diri,et al.  Clustering and Metrics Thresholds Based Software Fault Prediction of Unlabeled Program Modules , 2009, 2009 Sixth International Conference on Information Technology: New Generations.

[4]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[5]  Ali Selamat,et al.  An empirical study based on semi-supervised hybrid self-organizing map for software fault prediction , 2015, Knowl. Based Syst..

[6]  Audris Mockus,et al.  A large-scale empirical study of just-in-time quality assurance , 2013, IEEE Transactions on Software Engineering.

[7]  Cagatay Catal,et al.  A Comparison of Semi-Supervised Classification Approaches for Software Defect Prediction , 2014, J. Intell. Syst..

[8]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[9]  Lei Zhang,et al.  Metaface learning for sparse representation based face recognition , 2010, 2010 IEEE International Conference on Image Processing.

[10]  Jie Chen,et al.  Online Dictionary Learning for Kernel LMS , 2014, IEEE Transactions on Signal Processing.

[11]  Daoqiang Zhang,et al.  Two-Stage Cost-Sensitive Learning for Software Defect Prediction , 2014, IEEE Transactions on Reliability.

[12]  Rama Chellappa,et al.  Kernel dictionary learning , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Sashank Dara,et al.  Online Defect Prediction for Imbalanced Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[14]  Qinbao Song,et al.  Using Coding-Based Ensemble Learning to Improve Software Defect Prediction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[15]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.

[17]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[18]  Tim Menzies,et al.  Privacy and utility for defect prediction: Experiments with MORPH , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[19]  Alex Pentland,et al.  Face recognition using eigenfaces , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Sunghun Kim,et al.  Reducing Features to Improve Code Change-Based Bug Prediction , 2013, IEEE Transactions on Software Engineering.

[21]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets , 2015, ASE 2015.

[22]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23]  Zhi-Hua Zhou,et al.  Software Defect Detection with Rocus , 2011, Journal of Computer Science and Technology.

[24]  Ken-ichi Matsumoto,et al.  The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[25]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[26]  Xinli Yang,et al.  Deep Learning for Just-in-Time Defect Prediction , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[27]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[28]  Tian Jiang,et al.  Personalized defect prediction , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[29]  Xiao Liu,et al.  Semi-supervised Coupled Dictionary Learning for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Tim Menzies,et al.  Active learning and effort estimation: Finding the essential content of software effort estimation data , 2013, IEEE Transactions on Software Engineering.

[31]  Bojan Cukic,et al.  Software defect prediction using semi-supervised learning with dimension reduction , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[32]  Bojan Cukic,et al.  A Semi-supervised Approach to Software Defect Prediction , 2014, 2014 IEEE 38th Annual Computer Software and Applications Conference.

[33]  Baowen Xu,et al.  Cross-Project and Within-Project Semi-Supervised Software Defect Prediction Problems Study Using a Unified Solution , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[34]  Lorenzo Rosasco,et al.  Iterative Projection Methods for Structured Sparsity Regularization , 2009 .

[35]  Jun Wang,et al.  Compressed C4.5 Models for Software Defect Prediction , 2012, 2012 12th International Conference on Quality Software.

[36]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[37]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[38]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[39]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[40]  Rama Chellappa,et al.  Design of Non-Linear Kernel Dictionaries for Object Recognition , 2013, IEEE Transactions on Image Processing.

[41]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[42]  Baowen Xu,et al.  An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems , 2017, IEEE Transactions on Software Engineering.

[43]  Georgios B. Giannakis,et al.  Prediction of Partially Observed Dynamical Processes Over Networks via Dictionary Learning , 2014, IEEE Transactions on Signal Processing.

[44]  Jongmoon Baik,et al.  Value-cognitive boosting with a support vector machine for cross-project defect prediction , 2014, Empirical Software Engineering.

[45]  Naoyasu Ubayashi,et al.  Studying just-in-time defect prediction using cross-project models , 2015, Empirical Software Engineering.

[46]  Lucas Layman,et al.  LACE2: Better Privacy-Preserving Data Sharing for Cross Project Defect Prediction , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[47]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2018, IEEE Trans. Software Eng..

[48]  N. Cliff Dominance statistics: Ordinal analyses to answer ordinal questions. , 1993 .

[49]  Tao Wang,et al.  Naive Bayes Software Defect Prediction Model , 2010, 2010 International Conference on Computational Intelligence and Software Engineering.

[50]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[51]  Olcay Taner Yildiz,et al.  Software defect prediction using Bayesian networks , 2012, Empirical Software Engineering.

[52]  Bojan Cukic,et al.  An iterative semi-supervised approach to software fault prediction , 2011, Promise '11.

[53]  Yanli Liu,et al.  Non‐negative sparse‐based SemiBoost for software defect prediction , 2016, Softw. Test. Verification Reliab..

[54]  Yuming Zhou,et al.  Are Slice-Based Cohesion Metrics Actually Useful in Effort-Aware Post-Release Fault-Proneness Prediction? An Empirical Study , 2015, IEEE Transactions on Software Engineering.

[55]  Xiao-Yuan Jing,et al.  Label propagation based semi-supervised learning for software defect prediction , 2016, Automated Software Engineering.

[56]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[57]  David Lo,et al.  Active Semi-supervised Defect Categorization , 2015, 2015 IEEE 23rd International Conference on Program Comprehension.

[58]  Ayse Basar Bener,et al.  Ensemble of software defect predictors: a case study , 2008, ESEM '08.

[59]  Bruce Christianson,et al.  The misuse of the NASA metrics data program data sets for automated software defect prediction , 2011, EASE.

[60]  Jin Liu,et al.  Dictionary learning based software defect prediction , 2014, ICSE.

[61]  David Zhang,et al.  Multi-Label Dictionary Learning for Image Annotation , 2016, IEEE Transactions on Image Processing.

[62]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[63]  Rongxin Wu,et al.  ReLink: recovering links between bugs and changes , 2011, ESEC/FSE '11.

[64]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[65]  Vandana Bhattacherjee,et al.  Software Fault Prediction Using Quad Tree-Based K-Means Clustering Algorithm , 2012, IEEE Transactions on Knowledge and Data Engineering.

[66]  Ayse Basar Bener,et al.  Empirical evaluation of the effects of mixed project data on learning defect predictors , 2013, Inf. Softw. Technol..

[67]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[68]  David Lo,et al.  An Empirical Study of Classifier Combination for Cross-Project Defect Prediction , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[69]  David Lo,et al.  Combining Software Metrics and Text Features for Vulnerable File Prediction , 2015, 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS).

[70]  Shunzhi Zhu,et al.  An improved semi-supervised learning method for software defect prediction , 2014, J. Intell. Fuzzy Syst..

[71]  Zhi-Hua Zhou,et al.  Sample-based software defect prediction with active and semi-supervised learning , 2012, Automated Software Engineering.

[72]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[73]  Dong Yue,et al.  Multi-view low-rank dictionary learning for image classification , 2016, Pattern Recognit..

[74]  Jean Ponce,et al.  Task-Driven Dictionary Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[76]  Tim Menzies,et al.  Balancing Privacy and Utility in Cross-Company Defect Prediction , 2013, IEEE Transactions on Software Engineering.

[77]  Banu Diri,et al.  Unlabelled extra data do not always mean extra performance for semi‐supervised fault prediction , 2009, Expert Syst. J. Knowl. Eng..

[78]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.