ALTRA: Cross-Project Software Defect Prediction via Active Learning and Tradaboost

Cross-project defect prediction (CPDP) methods can be used when the target project is a new project or lacks enough labeled program modules. In these new target projects, we can easily extract and then measure these modules with software measurement tools. However, labeling these program modules is time-consuming, error-prone and requires professional domain knowledge. Moreover, directly using labeled modules in the other projects (i.e., the source projects) can not achieve satisfactory performance due to the large data distribution difference in most cases. In this article, to our best knowledge, we are the first to propose a novel method ALTRA, which can utilize both active learning and TrAdaBoost to alleviate this issue. In particular, we firstly use Burak filter to select similar labeled modules from the source project after analyzing the unlabeled modules in the target project. Then we use active learning to choose representative unlabeled modules from the target project and ask experts to label the type (i.e., defective or non-defective) of these modules. Later, we use TrAdaBoost to determine the weights of labeled modules in the source project and the target project, and then construct the model via weighted support vector machine. After selecting a small number of modules (i.e., only 5% modules) in the target project, we terminate the method ALTRA and return the final constructed model. To show the effectiveness of our proposed method ALTRA, we choose 10 large-scale open-source projects from different application domains. In terms of both F1 and AUC performance indicators, we find ALTRA can perform significantly better than seven state-of-the-art CPDP baselines. Moreover, we also show that the usage of Burak filter, the uncertainty active learning strategy, the class imbalanced learning method and TrAdaBoost are competitive in our proposed method ALTRA.

[1]  Jens Grabowski,et al.  A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches , 2018, IEEE Transactions on Software Engineering.

[2]  Zan Wang,et al.  Large-Scale Empirical Studies on Effort-Aware Security Vulnerability Prediction Methods , 2020, IEEE Transactions on Reliability.

[3]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[4]  Akito Monden,et al.  On the relative value of data resampling approaches for software defect prediction , 2018, Empirical Software Engineering.

[5]  Tim Menzies,et al.  Bellwethers: A Baseline Method for Transfer Learning , 2017, IEEE Transactions on Software Engineering.

[6]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[7]  Xiang Chen,et al.  Software defect number prediction: Unsupervised vs supervised methods , 2019, Inf. Softw. Technol..

[8]  Andrea De Lucia,et al.  Cross-project defect prediction models: L'Union fait la force , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[9]  Koichiro Ochimizu,et al.  Towards logistic regression models for predicting fault-prone code across software projects , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[10]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[11]  Jui Hsi Fu,et al.  Certainty-Enhanced Active Learning for Improving Imbalanced Data Classification , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[12]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[13]  Tim Menzies,et al.  "Better Data" is Better than "Better Data Miners" (Benefits of Tuning SMOTE for Defect Prediction) , 2017, ICSE.

[14]  Ahmed E. Hassan,et al.  The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models , 2018, IEEE Transactions on Software Engineering.

[15]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[16]  Chao Liu,et al.  A two-phase transfer learning model for cross-project defect prediction , 2019, Inf. Softw. Technol..

[17]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[18]  Burak Turhan,et al.  A Systematic Literature Review and Meta-Analysis on Cross Project Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[19]  Uirá Kulesza,et al.  A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes , 2017, IEEE Transactions on Software Engineering.

[20]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[21]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[22]  Xiang Chen,et al.  MULTI: Multi-objective effort-aware just-in-time software defect prediction , 2018, Inf. Softw. Technol..

[23]  Tim Menzies,et al.  Local vs. global models for effort estimation and defect prediction , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[24]  Bojan Cukic,et al.  An adaptive approach with active learning in software fault prediction , 2012, PROMISE '12.

[25]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[26]  Qiang Yang,et al.  Boosting for transfer learning , 2007, ICML '07.

[27]  Akito Monden,et al.  MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction , 2018, IEEE Transactions on Software Engineering.

[28]  Yuming Zhou,et al.  Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models , 2016, SIGSOFT FSE.

[29]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[30]  Shane McIntosh,et al.  Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[31]  Enio G. Jelihovschi,et al.  ScottKnott: A Package for Performing the Scott-Knott Clustering Algorithm in R , 2014 .

[32]  Zhi-Hua Zhou,et al.  Sample-based software defect prediction with active and semi-supervised learning , 2012, Automated Software Engineering.

[33]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[34]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[35]  Sousuke Amasaki,et al.  Improving Cross-Project Defect Prediction Methods with Data Simplification , 2015, 2015 41st Euromicro Conference on Software Engineering and Advanced Applications.

[36]  Jaechang Nam,et al.  Deep Semantic Feature Learning for Software Defect Prediction , 2020, IEEE Transactions on Software Engineering.

[37]  Shujuan Jiang,et al.  Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering , 2019, IEEE Access.

[38]  Meng Liu,et al.  Do different cross‐project defect prediction methods identify the same defective modules? , 2019, J. Softw. Evol. Process..

[39]  David Lo,et al.  The Impact of Mislabeled Changes by SZZ on Just-in-Time Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[40]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[41]  Daoxu Chen,et al.  A Cluster Based Feature Selection Method for Cross-Project Software Defect Prediction , 2017, Journal of Computer Science and Technology.

[42]  Andreas Zeller,et al.  When do changes induce fixes? , 2005, ACM SIGSOFT Softw. Eng. Notes.

[43]  Xiao Liu,et al.  An empirical study on software defect prediction with a simplified metric set , 2014, Inf. Softw. Technol..

[44]  David Lo,et al.  Perceptions, Expectations, and Challenges in Defect Prediction , 2020, IEEE Transactions on Software Engineering.

[45]  Xiang Chen,et al.  A Two-Stage Data Preprocessing Approach for Software Fault Prediction , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[46]  Bojan Cukic,et al.  Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[47]  Tao Zhang,et al.  Cross-version defect prediction via hybrid active learning with kernel principal component analysis , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[48]  Jian Li,et al.  Software Defect Prediction via Convolutional Neural Network , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[49]  Yasutaka Kamei,et al.  Defect Prediction: Accomplishments and Future Challenges , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).