A comparative study of iterative and non-iterative feature selection techniques for software defect prediction

Two important problems which can affect the performance of classification models are high-dimensionality (an overabundance of independent features in the dataset) and imbalanced data (a skewed class distribution which creates at least one class with many fewer instances than other classes). To resolve these problems concurrently, we propose an iterative feature selection approach, which repeated applies data sampling (in order to address class imbalance) followed by feature selection (in order to address high-dimensionality), and finally we perform an aggregation step which combines the ranked feature lists from the separate iterations of sampling. This approach is designed to find a ranked feature list which is particularly effective on the more balanced dataset resulting from sampling while minimizing the risk of losing data through the sampling step and missing important features. To demonstrate this technique, we employ 18 different feature selection algorithms and Random Undersampling with two post-sampling class distributions. We also investigate the use of sampling and feature selection without the iterative step (e.g., using the ranked list from a single iteration, rather than combining the lists from multiple iterations), and compare these results from the version which uses iteration. Our study is carried out using three groups of datasets with different levels of class balance, all of which were collected from a real-world software system. All of our experiments use four different learners and one feature subset size. We find that our proposed iterative feature selection approach outperforms the non-iterative approach.

[1]  Elena Marchiori,et al.  Feature selection in proteomic pattern data with support vector machines , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[2]  Stan Matwin,et al.  STochFS: A Framework for Combining Feature Selection Outcomes Through a Stochastic Process , 2005, PKDD.

[3]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[4]  Barry W. Boehm,et al.  Finding the right data for software cost modeling , 2005, IEEE Software.

[5]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[6]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2012, Springer Berlin Heidelberg.

[7]  Taghi M. Khoshgoftaar,et al.  Exploring an iterative feature selection technique for highly imbalanced data sets , 2012, 2012 IEEE 13th International Conference on Information Reuse & Integration (IRI).

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[9]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[10]  Taghi M. Khoshgoftaar,et al.  A COMPARATIVE STUDY OF FILTER-BASED AND WRAPPER-BASED FEATURE RANKING TECHNIQUES FOR SOFTWARE QUALITY MODELING , 2011 .

[11]  Ian Witten,et al.  Data Mining , 2000 .

[12]  Yue Jiang,et al.  Variance Analysis in Software Fault Prediction Models , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[13]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[14]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[15]  Debahuti Mishra,et al.  Feature Selection for Cancer Classification: A Signal-to-noise Ratio Approach , 2011 .

[16]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[17]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[18]  Ian B. Jeffery,et al.  Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data , 2006, BMC Bioinformatics.

[19]  R. Browne,et al.  A comparative. , 1950, The British journal of ophthalmology.

[20]  R. Wallace Is this a practical approach? , 2001, Journal of the American College of Surgeons.

[21]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[22]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[23]  Liang Goh,et al.  A Hybrid Feature Selection Approach for Microarray Gene Expression Data , 2006, International Conference on Computational Science.

[24]  Jesús S. Aguilar-Ruiz,et al.  Detecting Fault Modules Applying Feature Selection to Classifiers , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[25]  A. Gelman Variance, Analysis of , 2010 .

[26]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[27]  Abhijit S. Pandya,et al.  The Impact of Gene Selection on Imbalanced Microarray Expression Data , 2009, BICoB.

[28]  Adam A. Porter,et al.  Experimental Software Engineering: A Report on the State of the Art , 1995, 1995 17th International Conference on Software Engineering.

[29]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[30]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[31]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[32]  Taghi M. Khoshgoftaar,et al.  Predicting high-risk program modules by selecting the right software measurements , 2011, Software Quality Journal.

[33]  Taghi M. Khoshgoftaar,et al.  Software Engineering with Computational Intelligence and Machine Learning A Novel Software Metric Selection Technique Using the Area Under ROC Curves , 2010, SEKE.

[34]  Tian-Yu Liu,et al.  EasyEnsemble and Feature Selection for Imbalance Data Sets , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[35]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[36]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.