A Systematic Study of Cross-Project Defect Prediction With Meta-Learning

The prediction of defects in a target project based on data from external projects is called Cross-Project Defect Prediction (CPDP). Several methods have been proposed to improve the predictive performance of CPDP models. However, there is a lack of comparison among state-of-the-art methods. Moreover, previous work has shown that the most suitable method for a project can vary according to the project being predicted. This makes the choice of which method to use difficult. We provide an extensive experimental comparison of 31 CPDP methods derived from state-of-the-art approaches, applied to 47 versions of 15 open source software projects. Four methods stood out as presenting the best performances across datasets. However, the most suitable among these methods still varies according to the project being predicted. Therefore, we propose and evaluate a meta-learning solution designed to automatically select and recommend the most suitable CPDP method for a project. Our results show that the meta-learning solution is able to learn from previous experiences and recommend suitable methods dynamically. When compared to the base methods, however, the proposed solution presented minor difference of performance. These results provide valuable knowledge about the possibilities and limitations of a meta-learning solution applied for CPDP.

[1]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[2]  Lech Madeyski,et al.  Which process metrics can significantly improve defect prediction models? An empirical study , 2014, Software Quality Journal.

[3]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[5]  Nachiappan Nagappan,et al.  Predicting defects using network analysis on dependency graphs , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[6]  Alexandros Kalousis,et al.  Algorithm selection via meta-learning , 2002 .

[7]  Ken-ichi Matsumoto,et al.  Comments on “Researcher Bias: The Use of Machine Learning in Software Defect Prediction” , 2016, IEEE Transactions on Software Engineering.

[8]  Audris Mockus,et al.  Towards building a universal defect prediction model , 2014, MSR 2014.

[9]  Ahmed E. Hassan,et al.  Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[10]  Brendan Murphy,et al.  Can developer-module networks predict failures? , 2008, SIGSOFT '08/FSE-16.

[11]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[12]  Xiao Liu,et al.  An empirical study on software defect prediction with a simplified metric set , 2014, Inf. Softw. Technol..

[13]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Meta-learning to select the best meta-heuristic for the Traveling Salesman Problem: A comparison of meta-features , 2016, Neurocomputing.

[14]  Steffen Herbold,et al.  Training data selection for cross-project defect prediction , 2013, PROMISE.

[15]  Huei Diana Lee,et al.  Metalearning for choosing feature selection algorithms in data mining: Proposal of a new framework , 2017, Expert Syst. Appl..

[16]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[17]  Melanie Hilario,et al.  Feature Selection for Meta-learning , 2001, PAKDD.

[18]  Elaine J. Weyuker,et al.  Where the bugs are , 2004, ISSTA '04.

[19]  Bernd Bischl,et al.  To tune or not to tune: Recommending when to adjust SVM hyper-parameters via meta-learning , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[20]  Juan José del Coz,et al.  Binary relevance efficacy for multilabel classification , 2012, Progress in Artificial Intelligence.

[21]  Marian Jureczko,et al.  Significance of Different Software Metrics in Defect Prediction , 2011 .

[22]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[23]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[24]  Haruhiko Kaiya,et al.  Adapting a fault prediction model to allow inter languagereuse , 2008, PROMISE '08.

[25]  Bogdan Gabrys,et al.  Metalearning: a survey of trends and technologies , 2013, Artificial Intelligence Review.

[26]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[27]  Lior Rokach,et al.  Data Mining and Knowledge Discovery Handbook, 2nd ed , 2010, Data Mining and Knowledge Discovery Handbook, 2nd ed..

[28]  Xin Yao,et al.  A Learning-to-Rank Approach to Software Defect Prediction , 2015, IEEE Transactions on Reliability.

[29]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[30]  Andreas Dengel,et al.  Automatic classifier selection for non-experts , 2012, Pattern Analysis and Applications.

[31]  Xin Yao,et al.  The impact of parameter tuning on software effort estimation using learning machines , 2013, PROMISE.

[32]  Koichiro Ochimizu,et al.  Towards logistic regression models for predicting fault-prone code across software projects , 2009, ESEM 2009.

[33]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[34]  D. Wolpert The Supervised Learning No-Free-Lunch Theorems , 2002 .

[35]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[36]  Anjaneyulu Pasala,et al.  Evaluating Performance of Network Metrics for Bug Prediction in Software , 2013, 2013 20th Asia-Pacific Software Engineering Conference (APSEC).

[37]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[38]  Jens Grabowski,et al.  Global vs. local models for cross-project defect prediction , 2017, Empirical Software Engineering.

[39]  Zhi-Hua Zhou,et al.  A Unified View of Multi-Label Performance Measures , 2016, ICML.

[40]  Foutse Khomh,et al.  Predicting Bugs Using Antipatterns , 2013, 2013 IEEE International Conference on Software Maintenance.

[41]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[42]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[43]  Andreas Zeller,et al.  Predicting defects using change genealogies , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[44]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[45]  Tim Menzies,et al.  Special issue on repeatable results in software engineering prediction , 2012, Empirical Software Engineering.

[46]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[47]  Rodrigo C. Barros,et al.  A meta-learning framework for algorithm recommendation in software fault prediction , 2016, SAC.

[48]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[49]  John R. Rice,et al.  The Algorithm Selection Problem , 1976, Adv. Comput..

[50]  P. Brazdil,et al.  Analysis of results , 1995 .

[51]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[52]  Victor R. Basili,et al.  The influence of organizational structure on software quality , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[53]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[54]  Andrea De Lucia,et al.  Dynamic Selection of Classifiers in Bug Prediction: An Adaptive Method , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[55]  Kate Smith-Miles,et al.  On learning algorithm selection for classification , 2006, Appl. Soft Comput..

[56]  George D. C. Cavalcanti,et al.  META-DES.Oracle: Meta-learning and feature selection for dynamic ensemble selection , 2017, Inf. Fusion.

[57]  Anabela Afonso,et al.  Overview of Friedman’s Test and Post-hoc Analysis , 2015, Commun. Stat. Simul. Comput..

[58]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[59]  Tim Menzies,et al.  Learning from Open-Source Projects: An Empirical Study on Defect Prediction , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[60]  Dimuthu Gunarathna A systematic literature review on cross-project defect prediction , 2016 .

[61]  Martin G. Larson,et al.  Descriptive Statistics and Graphical Displays , 2006, Circulation.

[62]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[63]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[64]  Santos Davi P. dos,et al.  Automatic Selection of Learning Bias for Active Sampling , 2016 .

[65]  Ron Kohavi,et al.  Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology , 1995, KDD.

[66]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[67]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[68]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[69]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[70]  Tim Menzies,et al.  Finding conclusion stability for selecting the best effort predictor in software effort estimation , 2012, Automated Software Engineering.

[71]  Adenilso da Silva Simão,et al.  Feature Subset Selection and Instance Filtering for Cross-project Defect Prediction - Classification and Ranking , 2016, CLEI Electron. J..