Evaluating defect prediction approaches using a massive set of metrics: an empirical study

To evaluate the performance of a within-project defect prediction approach, people normally use precision, recall, and F-measure scores. However, in machine learning literature, there are a large number of evaluation metrics to evaluate the performance of an algorithm, (e.g., Matthews Correlation Coefficient, G-means, etc.), and these metrics evaluate an approach from different aspects. In this paper, we investigate the performance of within-project defect prediction approaches on a large number of evaluation metrics. We choose 6 state-of-the-art approaches including naive Bayes, decision tree, logistic regression, kNN, random forest and Bayesian network which are widely used in defect prediction literature. And we evaluate these 6 approaches on 14 evaluation metrics (e.g., G-mean, F-measure, balance, MCC, J-coefficient, and AUC). Our goal is to explore a practical and sophisticated way for evaluating the prediction approaches comprehensively. We evaluate the performance of defect prediction approaches on 10 defect datasets from PROMISE repository. The results show that Bayesian network achieves a noteworthy performance. It achieves the best recall, FN-R, G-mean1 and balance on 9 out of the 10 datasets, and F-measure and J-coefficient on 7 out of the 10 datasets.

[1]  Harald C. Gall,et al.  Cross-project defect prediction: a large scale experiment on data vs. domain vs. process , 2009, ESEC/SIGSOFT FSE.

[2]  Yue Jiang,et al.  Techniques for evaluating fault prediction models , 2008, Empirical Software Engineering.

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[5]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[6]  W. Pedrycz,et al.  Software quality prediction using median-adjusted class labels , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[7]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[8]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007 .

[9]  Elaine J. Weyuker,et al.  How to measure success of fault prediction models , 2007, SOQUA '07.

[10]  Arvinder Kaur,et al.  Empirical Evaluation of Machine Learning Algorithms for Fault Prediction , 2014 .

[11]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[12]  Khaled El Emam,et al.  Comparing case-based reasoning classifiers for predicting high risk software components , 2001, J. Syst. Softw..

[13]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[14]  Xiuzhen Zhang,et al.  Comments on "Data Mining Static Code Attributes to Learn Defect Predictors" , 2007, IEEE Trans. Software Eng..

[15]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[16]  Shihai Wang,et al.  An Empirical Study for Software Fault-Proneness Prediction with Ensemble Learning Models on Imbalanced Data Sets , 2014, J. Softw..

[17]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.