Isolation Forest Filter to Simplify Training Data for Cross-Project Defect Prediction

Cross-project defect prediction (CPDP) is an active research area. When the historical data is limited or a new project is to develop, establishing CPDP models is very useful, which assists software testers to judge the defect-prone entities or software managers to focus on the "important" parts by allocating the manpower, budget, time. However, the dissimilarity of data distributions between the source projects and the target project decreases the performance of CPDP models. How to simplify or the cross-project training data is an important problem. To solve this issue, an isolation forest (iForest) filter is proposed. We use 15 versions of different java projects from open PROMISE Data Repository and five typical predictors (naïve bayes (NB), decision tree (DT), logistic regression(LR), k-nearest neighbor(k-NN) and random forest(RF) to build 1050 (15*14*5) software defect prediction models (SDPM). Meanwhile, we compare our models with Burak Filter models and Peter Filter models. From the results of performance measures, called AUC, balance, G-measure, G-mean, F1-measure, we can know that our iForest filter is feasible and even better than other two. Therefore, using iForest filter can make cross-project training data simple and build efficient SDPM.

[1]  Jens Grabowski,et al.  Global vs. local models for cross-project defect prediction , 2017, Empirical Software Engineering.

[2]  Koichiro Ochimizu,et al.  Towards logistic regression models for predicting fault-prone code across software projects , 2009, ESEM 2009.

[3]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[4]  Ahmed E. Hassan,et al.  Think locally, act globally: Improving defect and effort prediction models , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[5]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2018, IEEE Trans. Software Eng..

[6]  Tim Menzies,et al.  Local vs. global models for effort estimation and defect prediction , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[7]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[8]  Gerardo Canfora,et al.  Multi-objective Cross-Project Defect Prediction , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[9]  Ye Yang,et al.  An investigation on the feasibility of cross-project defect prediction , 2012, Automated Software Engineering.

[10]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[11]  Ayse Basar Bener,et al.  On the relative value of cross-company and within-company data for defect prediction , 2009, Empirical Software Engineering.

[12]  Yutao Ma,et al.  Simplification of Training Data for Cross-Project Defect Prediction , 2014, ArXiv.

[13]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[14]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[15]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[16]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[17]  Andreas Zeller,et al.  Mining metrics to predict component failures , 2006, ICSE.

[18]  Guangchun Luo,et al.  Transfer learning for cross-company software defect prediction , 2012, Inf. Softw. Technol..

[19]  Sousuke Amasaki,et al.  Improving Relevancy Filter Methods for Cross-Project Defect Prediction , 2015, 2015 3rd International Conference on Applied Computing and Information Technology/2nd International Conference on Computational Science and Intelligence.

[20]  Steffen Herbold,et al.  Training data selection for cross-project defect prediction , 2013, PROMISE.