The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison

Software defect prediction aims to determine whether a software module is defect-prone by constructing prediction models. The performance of such models is susceptible to the high dimensionality of the datasets that may include irrelevant and redundant features. Feature selection is applied to alleviate this issue. Because many feature selection methods have been proposed, there is an imperative need to analyze and compare these methods. Prior empirical studies may have potential controversies and limitations, such as the contradictory results, usage of private datasets and inappropriate statistical test techniques. This observation leads us to conduct a careful empirical study to reinforce the confidence of the experimental conclusions by considering several potential source of bias, such as the noise in the dataset and the dataset types. In this paper, we investigate the impact of 32 feature selection methods on the defect prediction performance over two versions of the NASA dataset (i.e., the noisy and clean NASA datasets) and one open source AEEEM dataset. We use a state-of-the-art double Scott-Knott test technique to analyze these methods. Experimental results show that the effectiveness of these feature selection methods on defect prediction performance varies significantly over all the datasets.

[1]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[2]  Bojan Cukic,et al.  Software defect prediction using semi-supervised learning with dimension reduction , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[3]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[4]  Simon Fong,et al.  Improving classification accuracy using Fuzzy Clustering Coefficients of Variations (FCCV) feature selection algorithm , 2014, 2014 IEEE 15th International Symposium on Computational Intelligence and Informatics (CINTI).

[5]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[6]  Taghi M. Khoshgoftaar,et al.  A Study on First Order Statistics-Based Feature Selection Techniques on Software Metric Data , 2013, International Conference on Software Engineering and Knowledge Engineering.

[7]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[8]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[9]  Xiaoming Xu,et al.  A hybrid genetic algorithm for feature selection wrapper based on mutual information , 2007, Pattern Recognit. Lett..

[10]  Simon Fong,et al.  A novel feature selection by clustering coefficients of variations , 2014, Ninth International Conference on Digital Information Management (ICDIM 2014).

[11]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[12]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[13]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[14]  Xiang Chen,et al.  FECAR: A Feature Selection Framework for Software Defect Prediction , 2014, 2014 IEEE 38th Annual Computer Software and Applications Conference.

[15]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[16]  Xiang Chen,et al.  A Two-Stage Data Preprocessing Approach for Software Fault Prediction , 2014, 2014 Eighth International Conference on Software Security and Reliability.

[17]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[18]  Bojan Cukic,et al.  Defect Prediction between Software Versions with Active Learning and Dimensionality Reduction , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  David Lo,et al.  ELBlocker: Predicting blocking bugs with ensemble imbalance learning , 2015, Inf. Softw. Technol..

[21]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[22]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[23]  M. Thangaraj Performance Study on Rule-based Classification Techniques across Multiple Database Relations , 2013 .

[24]  Lipika Dey,et al.  A feature selection technique for classificatory analysis , 2005, Pattern Recognit. Lett..

[25]  N. Ramaraj,et al.  A novel hybrid feature selection via Symmetrical Uncertainty ranking based local memetic search algorithm , 2010, Knowl. Based Syst..

[26]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[27]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[28]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2015, IEEE Transactions on Software Engineering.

[29]  Lalita Bhanu Murthy Neti,et al.  Impact of Feature Selection Techniques on Bug Prediction Models , 2015, ISEC.

[30]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[31]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[32]  Sunghun Kim,et al.  Reducing Features to Improve Code Change-Based Bug Prediction , 2013, IEEE Transactions on Software Engineering.

[33]  J. Kinney,et al.  Equitability, mutual information, and the maximal information coefficient , 2013, Proceedings of the National Academy of Sciences.

[34]  Shane McIntosh,et al.  Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[35]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[36]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[37]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[38]  Gregory W. Corder,et al.  Nonparametric Statistics : A Step-by-Step Approach , 2014 .

[39]  Lefteris Angelis,et al.  Ranking and Clustering Software Cost Estimation Models through a Multiple Comparisons Algorithm , 2013, IEEE Transactions on Software Engineering.

[40]  Martin Shepperd,et al.  Data Sets and Data Quality in Software Engineering: Eight Years On , 2016, PROMISE.

[41]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[42]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[43]  Emad Shihab,et al.  Characterizing and predicting blocking bugs in open source projects , 2014, MSR 2014.

[44]  Dietmar Cordes,et al.  Hierarchical clustering to measure connectivity in fMRI resting-state data. , 2002, Magnetic resonance imaging.

[45]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[46]  Jin Liu,et al.  MICHAC: Defect Prediction via Feature Selection Based on Maximal Information Coefficient with Hierarchical Agglomerative Clustering , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[47]  Xiang Chen,et al.  Empirical Studies of a Two-Stage Data Preprocessing Approach for Software Fault Prediction , 2014, IEEE Transactions on Reliability.

[48]  Shane McIntosh,et al.  Revisiting the Impact of Classification Techniques on the Performance of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[49]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[50]  Ayse Basar Bener,et al.  Reducing false alarms in software defect prediction by decision threshold optimization , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[51]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[52]  Ingunn Myrtveit,et al.  Reliability and validity in comparative studies of software prediction models , 2005, IEEE Transactions on Software Engineering.

[53]  T. Chai,et al.  Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature , 2014 .

[54]  Shane McIntosh,et al.  Mining Co-change Information to Understand When Build Changes Are Necessary , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[55]  Martin J. Shepperd,et al.  Comparing Software Prediction Techniques Using Simulation , 2001, IEEE Trans. Software Eng..

[56]  Jin Liu,et al.  Dictionary learning based software defect prediction , 2014, ICSE.

[57]  Taghi M. Khoshgoftaar,et al.  Metric Selection for Software Defect Prediction , 2011, Int. J. Softw. Eng. Knowl. Eng..

[58]  Ayse Basar Bener,et al.  Defect prediction from static code features: current results, limitations, new approaches , 2010, Automated Software Engineering.

[59]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[60]  Kim Herzig,et al.  Using Pre-Release Test Failures to Build Early Post-Release Defect Prediction Models , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[61]  Huan Liu,et al.  Consistency Based Feature Selection , 2000, PAKDD.

[62]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[63]  David Lo,et al.  Cross-project build co-change prediction , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[64]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[65]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[66]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[67]  A. Scott,et al.  A Cluster Analysis Method for Grouping Means in the Analysis of Variance , 1974 .

[68]  Ahmed E. Hassan,et al.  Prioritizing the devices to test your app on: a case study of Android game apps , 2014, SIGSOFT FSE.

[69]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[70]  Alan Agresti,et al.  Building and Applying Logistic Regression Models , 2003 .

[71]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.