Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction

Cross-project defect prediction (CPDP) refers to predicting defects in a target project using prediction models trained from historical data of other source projects. And CPDP in the scenario where source and target projects have different metric sets is called heterogeneous defect prediction (HDP). Recently, HDP has received much research interest. Existing HDP methods only consider the linear correlation relationship among the features (metrics) of the source and target projects, and such models are insufficient to evaluate nonlinear correlation relationship among the features. So these methods may suffer from the linearly inseparable problem in the linear feature space. Furthermore, existing HDP methods do not take the class imbalance problem into consideration. Unfortunately, the imbalanced nature of software defect datasets increases the learning difficulty for the predictors. In this paper, we propose a new cost-sensitive transfer kernel canonical correlation analysis (CTKCCA) approach for HDP. CTKCCA can not only make the data distributions of source and target projects much more similar in the nonlinear feature space, where the learned features have favorable separability, but also utilize the different misclassification costs for defective and defect-free classes to alleviate the class imbalance problem. We perform the Friedman test with Nemenyi’s post-hoc statistical test and the Cliff’s delta effect size test for the evaluation. Extensive experiments on 28 public projects from five data sources indicate that: (1) CTKCCA significantly performs better than the related CPDP methods; (2) CTKCCA performs better than the related state-of-the-art HDP methods.

[1]  Josef Kittler,et al.  Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Michele Lanza,et al.  Evaluating defect prediction approaches: a benchmark and an extensive comparison , 2011, Empirical Software Engineering.

[3]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[4]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[5]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[6]  Aleix M. Martínez,et al.  Kernel Optimization in Discriminant Analysis , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[8]  Andrea De Lucia,et al.  Cross-project defect prediction models: L'Union fait la force , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[9]  Yue Jiang,et al.  Misclassification cost-sensitive fault prediction models , 2009, PROMISE '09.

[10]  Premkumar T. Devanbu,et al.  Recalling the "imprecision" of cross-project defect prediction , 2012, SIGSOFT FSE.

[11]  Karthikeyan Natesan Ramamurthy,et al.  Multiple Kernel Sparse Representations for Supervised and Unsupervised Learning , 2013, IEEE Transactions on Image Processing.

[12]  Jongmoon Baik,et al.  Value-cognitive boosting with a support vector machine for cross-project defect prediction , 2014, Empirical Software Engineering.

[13]  Naoyasu Ubayashi,et al.  Studying just-in-time defect prediction using cross-project models , 2015, Empirical Software Engineering.

[14]  Daoqiang Zhang,et al.  Two-Stage Cost-Sensitive Learning for Software Defect Prediction , 2014, IEEE Transactions on Reliability.

[15]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[16]  Tian Jiang,et al.  Personalized defect prediction , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[17]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[18]  Audris Mockus,et al.  Towards building a universal defect prediction model with rank transformed predictors , 2016, Empirical Software Engineering.

[19]  Audris Mockus,et al.  A large-scale empirical study of just-in-time quality assurance , 2013, IEEE Transactions on Software Engineering.

[20]  Gerardo Canfora,et al.  Defect prediction as a multiobjective optimization problem , 2015, Softw. Test. Verification Reliab..

[21]  Xiao-Yuan Jing,et al.  Software defect prediction based on collaborative representation classification , 2014, ICSE Companion.

[22]  Yue Jiang,et al.  Techniques for evaluating fault prediction models , 2008, Empirical Software Engineering.

[23]  Rongxin Wu,et al.  Dealing with noise in defect prediction , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[24]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2015, IEEE Transactions on Software Engineering.

[25]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets , 2015, ASE 2015.

[26]  Rongxin Wu,et al.  ReLink: recovering links between bugs and changes , 2011, ESEC/FSE '11.

[27]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[29]  Qinbao Song,et al.  Using Coding-Based Ensemble Learning to Improve Software Defect Prediction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[30]  Vince D. Calhoun,et al.  Joint Blind Source Separation by Multiset Canonical Correlation Analysis , 2009, IEEE Transactions on Signal Processing.

[31]  Ayse Basar Bener,et al.  Empirical evaluation of the effects of mixed project data on learning defect predictors , 2013, Inf. Softw. Technol..

[32]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[33]  Steven Van Vaerenbergh,et al.  Kernel Methods for Nonlinear Identification, Equalization and Separation of Signals , 2010 .

[34]  Sunghun Kim,et al.  Reducing Features to Improve Code Change-Based Bug Prediction , 2013, IEEE Transactions on Software Engineering.

[35]  Brian C. Lovell,et al.  Unsupervised Domain Adaptation by Domain Invariant Projection , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Burak Turhan,et al.  Implications of ceiling effects in defect predictors , 2008, PROMISE '08.

[37]  Tim Menzies,et al.  Balancing Privacy and Utility in Cross-Company Defect Prediction , 2013, IEEE Transactions on Software Engineering.

[38]  Tim Menzies,et al.  Learning from Open-Source Projects: An Empirical Study on Defect Prediction , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[39]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[40]  N. Cliff Ordinal methods for behavioral data analysis , 1996 .

[41]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[42]  Hao Chen,et al.  Kernel Based Asymmetric Learning for Software Defect Prediction , 2012, IEICE Trans. Inf. Syst..

[43]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[44]  Xiao-Yuan Jing,et al.  Intra-View and Inter-View Supervised Correlation Analysis for Multi-View Feature Learning , 2014, AAAI.

[45]  Ayse Basar Bener,et al.  Defect prediction from static code features: current results, limitations, new approaches , 2010, Automated Software Engineering.

[46]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[47]  Guangming Shi,et al.  Visual Orientation Selectivity Based Structure Description , 2015, IEEE Transactions on Image Processing.

[48]  Jongmoon Baik,et al.  A transfer cost-sensitive boosting approach for cross-project defect prediction , 2017, Software Quality Journal.

[49]  Xiao-Yuan Jing,et al.  Label propagation based semi-supervised learning for software defect prediction , 2016, Automated Software Engineering.

[50]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[51]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[52]  Ayse Basar Bener,et al.  Practical considerations in deploying statistical methods for defect prediction: A case study within the Turkish telecommunications industry , 2010, Inf. Softw. Technol..

[53]  David Lo,et al.  Cross-project build co-change prediction , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[54]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[55]  YU Chun-mei Kernel independent component analysis for discrimination , 2011 .

[56]  Jin Liu,et al.  Dictionary learning based software defect prediction , 2014, ICSE.

[57]  Yue Jiang,et al.  Cost Curve Evaluation of Fault Prediction Models , 2008, 2008 19th International Symposium on Software Reliability Engineering (ISSRE).

[58]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[59]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.

[60]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[61]  Lionel C. Briand,et al.  A systematic and comprehensive investigation of methods to build and evaluate fault prediction models , 2010, J. Syst. Softw..

[62]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[63]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007 .

[64]  Jiwen Lu,et al.  Cost-Sensitive Subspace Analysis and Extensions for Face Recognition , 2013, IEEE Transactions on Information Forensics and Security.

[65]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[66]  Sashank Dara,et al.  Online Defect Prediction for Imbalanced Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[67]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[68]  Haruhiko Kaiya,et al.  Adapting a fault prediction model to allow inter languagereuse , 2008, PROMISE '08.

[69]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[70]  Ying Ma,et al.  On Software Defect Prediction Using Machine Learning , 2014, J. Appl. Math..

[71]  Taghi M. Khoshgoftaar,et al.  Improving Software-Quality Predictions With Data Sampling and Boosting , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[72]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[73]  Hoh Peter In,et al.  Developer Micro Interaction Metrics for Software Defect Prediction , 2016, IEEE Transactions on Software Engineering.

[74]  Ken-ichi Matsumoto,et al.  The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[75]  David Lo,et al.  HYDRA: Massively Compositional Model for Cross-Project Defect Prediction , 2016, IEEE Transactions on Software Engineering.

[76]  Boris S. Mordukhovich,et al.  Multiobjective optimization problems with equilibrium constraints , 2008, Math. Program..

[77]  Koichiro Ochimizu,et al.  Towards logistic regression models for predicting fault-prone code across software projects , 2009, ESEM 2009.

[78]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[79]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[80]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[81]  Zhi-Hua Zhou,et al.  Sample-based software defect prediction with active and semi-supervised learning , 2012, Automated Software Engineering.

[82]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[83]  Audris Mockus,et al.  Towards building a universal defect prediction model , 2014, MSR 2014.

[84]  Xiao Liu,et al.  An empirical study on software defect prediction with a simplified metric set , 2014, Inf. Softw. Technol..

[85]  Xiuzhen Zhang,et al.  Comments on "Data Mining Static Code Attributes to Learn Defect Predictors" , 2007, IEEE Trans. Software Eng..

[86]  Steffen Herbold,et al.  Training data selection for cross-project defect prediction , 2013, PROMISE.

[87]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[88]  Scott Dick,et al.  Evaluating Stratification Alternatives to Improve Software Defect Prediction , 2012, IEEE Transactions on Reliability.

[89]  Xiaoyuan Jing,et al.  Multiple kernel ensemble learning for software defect prediction , 2015, Automated Software Engineering.

[90]  B. Thompson Canonical Correlation Analysis: Uses and Interpretation , 1984 .

[91]  Taghi M. Khoshgoftaar,et al.  Cost-sensitive boosting in software quality modeling , 2002, 7th IEEE International Symposium on High Assurance Systems Engineering, 2002. Proceedings..

[92]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[93]  Yunde Jia,et al.  Cross-View Action Recognition over Heterogeneous Feature Spaces , 2013, 2013 IEEE International Conference on Computer Vision.

[94]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[95]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[96]  Zhongzhi Shi,et al.  Classification of big velocity data via cross-domain Canonical Correlation Analysis , 2013, 2013 IEEE International Conference on Big Data.

[97]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[98]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[99]  Zhaowei Shang,et al.  Negative samples reduction in cross-company software defects prediction , 2015, Inf. Softw. Technol..

[100]  Yu-Chiang Frank Wang,et al.  Heterogeneous Domain Adaptation and Classification by Exploiting the Correlation Subspace , 2014, IEEE Transactions on Image Processing.

[101]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[102]  Paulo J. L. Adeodato,et al.  Predicting software defects: A cost-sensitive approach , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[103]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[104]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[105]  Lucas Layman,et al.  LACE2: Better Privacy-Preserving Data Sharing for Cross Project Defect Prediction , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[106]  Shane McIntosh,et al.  Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).