A systematic mapping study on cross-project defect prediction

Cross-Project-Defect Prediction as a sub-topic of defect prediction in general has become a popular topic in research. In this article, we present a systematic mapping study with the focus on CPDP, for which we found 50 publications. We summarize the approaches presented by each publication and discuss the case study setups and results. We discovered a great amount of heterogeneity in the way case studies are conducted, because of differences in the data sets, classifiers, performance metrics, and baseline comparisons used. Due to this, we could not compare the results of our review on a qualitative basis, i.e., determine which approaches perform best for CPDP.

[1]  Robert C. Holte,et al.  Cost curves: An improved method for visualizing classifier performance , 2006, Machine Learning.

[2]  Bernd Gärtner,et al.  Understanding and using linear programming , 2007, Universitext.

[3]  Forrest Shull,et al.  Local versus Global Lessons for Defect Prediction and Effort Estimation , 2013, IEEE Transactions on Software Engineering.

[4]  Om Prakash Vyas,et al.  Cross Company and within Company Fault Prediction using Object Oriented Metrics , 2013 .

[5]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[6]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[7]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[8]  Franz Wotawa,et al.  Novel Insights on Cross Project Fault Prediction Applied to Automotive Software , 2015, ICTSS.

[9]  Sousuke Amasaki,et al.  Improving Cross-Project Defect Prediction Methods with Data Simplification , 2015, 2015 41st Euromicro Conference on Software Engineering and Advanced Applications.

[10]  Lucas Layman,et al.  LACE2: Better Privacy-Preserving Data Sharing for Cross Project Defect Prediction , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[11]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2018, IEEE Trans. Software Eng..

[12]  Yutao Ma,et al.  Towards Cross-Project Defect Prediction with Imbalanced Feature Sets , 2014, ArXiv.

[13]  Burak Turhan,et al.  Implications of ceiling effects in defect predictors , 2008, PROMISE '08.

[14]  Audris Mockus,et al.  Towards building a universal defect prediction model , 2014, MSR 2014.

[15]  Xiao Liu,et al.  An empirical study on software defect prediction with a simplified metric set , 2014, Inf. Softw. Technol..

[16]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[17]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[18]  LiGuo Huang,et al.  Text mining in supporting software systems risk assurance , 2010, ASE '10.

[19]  Tracy Hall,et al.  DConfusion: a technique to allow cross study performance evaluation of fault prediction studies , 2013, Automated Software Engineering.

[20]  Thomas R. Ioerger,et al.  Enhancing Learning using Feature and Example selection , 2003 .

[21]  Tim Menzies,et al.  Balancing Privacy and Utility in Cross-Company Defect Prediction , 2013, IEEE Transactions on Software Engineering.

[22]  Koichiro Ochimizu,et al.  Towards logistic regression models for predicting fault-prone code across software projects , 2009, ESEM 2009.

[23]  Tim Menzies,et al.  Learning from Open-Source Projects: An Empirical Study on Defect Prediction , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[24]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[25]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[26]  Premkumar T. Devanbu,et al.  Recalling the "imprecision" of cross-project defect prediction , 2012, SIGSOFT FSE.

[27]  U. Fayyad Knowledge Discovery and Data Mining: An Overview , 1995 .

[28]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[29]  A. Vargha,et al.  A Critique and Improvement of the CL Common Language Effect Size Statistics of McGraw and Wong , 2000 .

[30]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[31]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[32]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[34]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[35]  D. Hosmer,et al.  Goodness of fit tests for the multiple logistic regression model , 1980 .

[36]  Jongmoon Baik,et al.  A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction , 2015, Journal of Computer Science and Technology.

[37]  Lech Madeyski,et al.  Cross-Project Defect Prediction With Respect To Code Ownership Model: An Empirical Study , 2015, e Informatica Softw. Eng. J..

[38]  Niclas Ohlsson,et al.  Predicting Fault-Prone Software Modules in Telephone Switches , 1996, IEEE Trans. Software Eng..

[39]  Taghi M. Khoshgoftaar,et al.  Software quality analysis by combining multiple projects and learners , 2008, Software Quality Journal.

[40]  Haruhiko Kaiya,et al.  Adapting a fault prediction model to allow inter languagereuse , 2008, PROMISE '08.

[41]  Hüsnü Yenigün,et al.  Testing Software and Systems , 2015, Lecture Notes in Computer Science.

[42]  Tim Menzies,et al.  Active learning and effort estimation: Finding the essential content of software effort estimation data , 2013, IEEE Transactions on Software Engineering.

[43]  Subhash Sharma Applied multivariate techniques , 1995 .

[44]  Zhaowei Shang,et al.  Negative samples reduction in cross-company software defects prediction , 2015, Inf. Softw. Technol..

[45]  S. Shapiro,et al.  An analysis of variance test for normality ( complete samp 1 es ) t , 2007 .

[46]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[47]  E. Schikuta GRID-CLUSTERING: A FAST HIERARCHICAL CLUSTERING METHOD FOR VERY LARGE DATA SETS , 1993 .

[48]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[49]  Audris Mockus Amassing and indexing a large sample of version control systems: Towards the census of public source code history , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[50]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[51]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[52]  Audris Mockus,et al.  Towards building a universal defect prediction model with rank transformed predictors , 2016, Empirical Software Engineering.

[53]  Gerardo Canfora,et al.  Defect prediction as a multiobjective optimization problem , 2015, Softw. Test. Verification Reliab..

[54]  Akito Monden,et al.  An Ensemble Approach of Simple Regression Models to Cross-Project Fault Prediction , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[55]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[56]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[57]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets , 2015, ASE 2015.

[58]  L. Penrose,et al.  THE CORRELATION BETWEEN RELATIVES ON THE SUPPOSITION OF MENDELIAN INHERITANCE , 2022 .

[59]  Andreas Zeller,et al.  Predicting defects using change genealogies , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[60]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[61]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.

[62]  Taghi M. Khoshgoftaar,et al.  Unsupervised learning for expert-based software quality estimation , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[63]  Qiang Yang,et al.  Boosting for transfer learning , 2007, ICML '07.

[64]  N. Cliff Dominance statistics: Ordinal analyses to answer ordinal questions. , 1993 .

[65]  Burak Turhan,et al.  On the dataset shift problem in software engineering prediction models , 2011, Empirical Software Engineering.

[66]  Julian Davies,et al.  Relative value , 2020, Nature.

[67]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[68]  Radu Marinescu,et al.  Detection strategies: metrics-based rules for detecting design flaws , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[69]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[70]  Ayse Basar Bener,et al.  Empirical Evaluation of Mixed-Project Defect Prediction Models , 2011, 2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications.

[71]  F. Mosteller,et al.  Understanding robust and exploratory data analysis , 1985 .

[72]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[73]  Qing Sun,et al.  Software defect prediction via transfer learning based neural network , 2015, 2015 First International Conference on Reliability Systems Engineering (ICRSE).

[74]  Hisashi Kashima,et al.  Unsupervised Change Analysis Using Supervised Learning , 2008, PAKDD.

[75]  Eric Eaton,et al.  Selective Transfer Between Learning Tasks Using Task-Based Boosting , 2011, AAAI.

[76]  Gerardo Canfora,et al.  Multi-objective Cross-Project Defect Prediction , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[77]  Naoyasu Ubayashi,et al.  An empirical study of just-in-time defect prediction using cross-project models , 2014, MSR 2014.

[78]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[79]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[80]  Sousuke Amasaki,et al.  Improving Relevancy Filter Methods for Cross-Project Defect Prediction , 2015, 2015 3rd International Conference on Applied Computing and Information Technology/2nd International Conference on Computational Science and Intelligence.

[81]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[82]  Tim Menzies,et al.  Local vs. global models for effort estimation and defect prediction , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[83]  Jongmoon Baik,et al.  Value-cognitive boosting with a support vector machine for cross-project defect prediction , 2014, Empirical Software Engineering.

[84]  Naoyasu Ubayashi,et al.  Studying just-in-time defect prediction using cross-project models , 2015, Empirical Software Engineering.

[85]  Andreas Zeller,et al.  Mining metrics to predict component failures , 2006, ICSE.

[86]  Ayse Basar Bener,et al.  Empirical evaluation of the effects of mixed project data on learning defect predictors , 2013, Inf. Softw. Technol..

[87]  Taghi M. Khoshgoftaar,et al.  Evolutionary Optimization of Software Quality Modeling with Multiple Repositories , 2010, IEEE Transactions on Software Engineering.

[88]  Tim Menzies,et al.  Privacy and utility for defect prediction: Experiments with MORPH , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[89]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[90]  David Lo,et al.  An Empirical Study of Classifier Combination for Cross-Project Defect Prediction , 2015, 2015 IEEE 39th Annual Computer Software and Applications Conference.

[91]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[92]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[93]  Banu Diri,et al.  Clustering and Metrics Thresholds Based Software Fault Prediction of Unlabeled Program Modules , 2009, 2009 Sixth International Conference on Information Technology: New Generations.

[94]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[95]  Jongmoon Baik,et al.  A transfer cost-sensitive boosting approach for cross-project defect prediction , 2017, Software Quality Journal.

[96]  Tsutomu Ishida,et al.  Metrics and Models in Software Quality Engineering , 1995 .

[97]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[98]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[99]  Osamu Mizuno,et al.  A Cross-Project Evaluation of Text-Based Fault-Prone Module Prediction , 2014, 2014 6th International Workshop on Empirical Software Engineering in Practice.

[100]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .