论文信息 - A two-phase transfer learning model for cross-project defect prediction

A two-phase transfer learning model for cross-project defect prediction

Abstract Context: Previous studies have shown that a transfer learning model, TCA+ proposed by Nam et al., can significantly improve the performance of cross-project defect prediction (CPDP). TCA+ achieves the improvement by reducing data distribution difference between source (training data) and target (testing data) projects. However, TCA+ is unstable, i.e., its performance varies largely when using different source projects to build prediction models. In practice, it is hard to choose a suitable source project to build the prediction model. Objective: To address the limitation of TCA+, we propose a two-phase transfer learning model (TPTL) for CPDP. Method: In the first phase, we propose a source project estimator (SPE) to automatically choose two source projects with the highest distribution similarity to a target project from candidates. Next, two source projects that are estimated to achieve the highest values of F1-score and cost-effectiveness are selected. In the second phase, we leverage TCA+ to build two prediction models based on the two selected projects and combine their prediction results to further improve the prediction performance. Results: We evaluate TPTL on 42 defect datasets from PROMISE repository, and compare it with two versions of TCA+ (TCA+_Rnd, randomly selecting one source project; TCA+_All, using all alternative source projects), a related source project selection model TDS proposed by Herbold, a state-of-the-art CPDP model leveraging a log transformation (LT) method, and a transfer learning model Dycom with better form of TCA. Experiment results show that, on average across 42 datasets, TPTL respectively improves these baseline models by 19%, 5%, 36%, 27%, and 11% in terms of F1-score; by 64%, 92%, 71%, 11%, and 66% in terms of cost-effectiveness. Conclusion: The proposed TPTL model can solve the instability problem of TCA+, showing substantial improvements over the state-of-the-art and related CPDP models.

[1] Elaine J. Weyuker,et al. Predicting the location and number of faults in large software systems , 2005, IEEE Transactions on Software Engineering.

[2] Taghi M. Khoshgoftaar,et al. Evolutionary Optimization of Software Quality Modeling with Multiple Repositories , 2010, IEEE Transactions on Software Engineering.

[3] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4] S. Sathiya Keerthi,et al. Improvements to the SMO algorithm for SVM regression , 2000, IEEE Trans. Neural Networks Learn. Syst..

[5] Emilia Mendes,et al. How to Make Best Use of Cross-Company Data for Web Effort Estimation? , 2015, 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[6] Thomas J. Ostrand,et al. \{PROMISE\} Repository of empirical software engineering data , 2007 .

[7] N. Cliff. Ordinal methods for behavioral data analysis , 1996 .

[8] Brian Henderson-Sellers,et al. Object-Oriented Metrics , 1995, TOOLS.

[9] Andrea De Lucia,et al. Cross-project defect prediction models: L'Union fait la force , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[10] Burak Turhan,et al. A Systematic Literature Review and Meta-Analysis on Cross Project Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[11] Ye Yang,et al. An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[12] Tian Jiang,et al. Personalized defect prediction , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[13] Ayse Basar Bener,et al. On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[14] Yi Zhang,et al. Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[15] Koichiro Ochimizu,et al. Towards logistic regression models for predicting fault-prone code across software projects , 2009, ESEM 2009.

[16] Michele Lanza,et al. An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[17] Lech Madeyski,et al. Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[18] Tim Menzies,et al. Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[19] Ivor W. Tsang,et al. Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[20] Banu Diri,et al. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[21] D. Hosmer,et al. Applied Logistic Regression , 1991 .

[22] Tim Menzies,et al. Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[23] Nachiappan Nagappan,et al. Predicting defects using network analysis on dependency graphs , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[24] Ahmed E. Hassan,et al. Predicting faults using the complexity of code changes , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[25] Steffen Herbold,et al. Training data selection for cross-project defect prediction , 2013, PROMISE.

[26] Burak Turhan,et al. On the dataset shift problem in software engineering prediction models , 2011, Empirical Software Engineering.

[27] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[28] Lionel C. Briand,et al. Data Mining Techniques for Building Fault-proneness Models in Telecom Java Software , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[29] Sunghun Kim,et al. Reducing Features to Improve Bug Prediction , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[30] Guangchun Luo,et al. Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[31] Premkumar T. Devanbu,et al. Sample size vs. bias in defect prediction , 2013, ESEC/FSE 2013.

[32] Jens Grabowski,et al. A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches , 2018, IEEE Transactions on Software Engineering.

[33] F. Wilcoxon. Individual Comparisons by Ranking Methods , 1945 .

[34] Sinno Jialin Pan,et al. Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[35] Harald C. Gall,et al. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[36] Tim Menzies,et al. Data Mining Static Code Attributes to Learn Defect Predictors , 2007 .

[37] Premkumar T. Devanbu,et al. How, and why, process metrics are better , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[38] Niclas Ohlsson,et al. Predicting Fault-Prone Software Modules in Telephone Switches , 1996, IEEE Trans. Software Eng..

[39] Haruhiko Kaiya,et al. Adapting a fault prediction model to allow inter languagereuse , 2008, PROMISE '08.