Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study

Feature selection (FS) is a feasible solution for mitigating high dimensionality problem, and many FS methods have been proposed in the context of software defect prediction (SDP). Moreover, many empirical studies on the impact and effectiveness of FS methods on SDP models often lead to contradictory experimental results and inconsistent findings. These contradictions can be attributed to relative study limitations such as small datasets, limited FS search methods, and unsuitable prediction models in the respective scope of studies. It is hence critical to conduct an extensive empirical study to address these contradictions to guide researchers and buttress the scientific tenacity of experimental conclusions. In this study, we investigated the impact of 46 FS methods using Naive Bayes and Decision Tree classifiers over 25 software defect datasets from 4 software repositories (NASA, PROMISE, ReLink, and AEEEM). The ensuing prediction models were evaluated based on accuracy and AUC values. Scott–KnottESD and the novel Double Scott–KnottESD rank statistical methods were used for statistical ranking of the studied FS methods. The experimental results showed that there is no one best FS method as their respective performances depends on the choice of classifiers, performance evaluation metrics, and dataset. However, we recommend the use of statistical-based, probability-based, and classifier-based filter feature ranking (FFR) methods, respectively, in SDP. For filter subset selection (FSS) methods, correlation-based feature selection (CFS) with metaheuristic search methods is recommended. For wrapper feature selection (WFS) methods, the IWSS-based WFS method is recommended as it outperforms the conventional SFS and LHS-based WFS methods.

[1]  Said Jadid Abdul Kadir,et al.  Binary Optimization Using Hybrid Grey Wolf Optimization for Feature Selection , 2019, IEEE Access.

[2]  Shane McIntosh,et al.  An Empirical Comparison of Model Validation Techniques for Defect Prediction Models , 2017, IEEE Transactions on Software Engineering.

[3]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2018, IEEE Trans. Software Eng..

[4]  Jin Liu,et al.  The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[5]  M. Anbu,et al.  Feature selection using firefly algorithm in software defect prediction , 2017, Cluster Computing.

[6]  Taghi M. Khoshgoftaar,et al.  Metric Selection for Software Defect Prediction , 2011, Int. J. Softw. Eng. Knowl. Eng..

[7]  Sunghun Kim,et al.  Reducing Features to Improve Code Change-Based Bug Prediction , 2013, IEEE Transactions on Software Engineering.

[8]  Shane McIntosh,et al.  The Impact of Automated Parameter Optimization on Defect Prediction Models , 2018, IEEE Transactions on Software Engineering.

[9]  Qinbao Song,et al.  A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[10]  Modinat Abolore Mabayoje,et al.  Parameter tuning in KNN for software defect prediction: an empirical analysis , 2019, Jurnal Teknologi dan Sistem Komputer.

[11]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[12]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[13]  Xiao Liu,et al.  An empirical study on software defect prediction with a simplified metric set , 2014, Inf. Softw. Technol..

[14]  Shujuan Jiang,et al.  The Performance Stability of Defect Prediction Models with Class Imbalance: An Empirical Study , 2017, IEICE Trans. Inf. Syst..

[15]  Abdullateef Oluwagbemiga Balogun,et al.  Software Defect Prediction Using Ensemble Learning: An ANP Based Evaluation Method , 2018, FUOYE Journal of Engineering and Technology.

[16]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[17]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[18]  Shuib Basri,et al.  Performance Analysis of Feature Selection Methods in Software Defect Prediction: A Search Method Approach , 2019, Applied Sciences.

[19]  Yun Yang,et al.  A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making , 2017, J. Biomed. Informatics.

[20]  Arti Arya,et al.  A Study on Software Metrics based Software Defect Prediction using Data Mining and Machine Learning Techniques , 2015 .

[21]  Osamu Mizuno,et al.  The impact of feature reduction techniques on defect prediction models , 2019, Empirical Software Engineering.

[22]  Hareton K. N. Leung,et al.  Mining Static Code Metrics for a Robust Prediction of Software Defect-Proneness , 2011, 2011 International Symposium on Empirical Software Engineering and Measurement.

[23]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[24]  John Yearwood,et al.  A parallel framework for software defect detection and metric selection on cloud computing , 2017, Cluster Computing.

[25]  Taghi M. Khoshgoftaar,et al.  Predicting high-risk program modules by selecting the right software measurements , 2011, Software Quality Journal.

[26]  Mojtaba Vahidi-Asl,et al.  SLDeep: Statement-level software defect prediction using deep-learning model on static code features , 2020, Expert Syst. Appl..

[27]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Feature Ranking Techniques for Software Quality Prediction , 2012, Int. J. Softw. Eng. Knowl. Eng..

[28]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[29]  A. G. Akintola,et al.  Comparative Analysis of Selected Heterogeneous Classifiers for Software Defects Prediction Using Filter-Based Feature Selection Methods , 2018 .