Assessments of Feature Selection Techniques with Respect to Data Sampling for Highly Imbalanced Software Measurement Data

In the process of software defect prediction, a classification model is first built using software metrics and fault data gathered from a past software development project, then that model is applied to data in a similar project or a new release of the same project to predict new program modules as either fault-prone (fp) or not-fault-prone (nfp). The benefit of such a model is to facilitate the optimal use of limited financial and human resources for software testing and inspection. The predictive power of a classification model constructed from a given data set is affected by many factors. In this paper, we are more interested in two problems that often arise in software measurement data: high dimensionality and unequal example set size of the two types of modules (e.g., many more nfp modules than fp modules found in a data set). These directly result in learning time extension and a decline in predictive performance of classification models. We consider using data sampling followed by feature selection (FS) to deal with these problems. Six data sampling strategies (which are made up of three sampling techniques, each consisting of two post-sampling proportion ratios) and six commonly used feature ranking approaches are employed in this study. We evaluate the FS techniques by means of: (1) a general method, i.e., assessing the classification performance after the training data is modified, and (2) studying the stability of a FS method, specifically with the goal of understanding the effect of data sampling techniques on the stability of FS when using the sampled data. The experiments were performed on nine data sets from a real-world software project. The results demonstrate that the FS techniques that most enhance the models' classification performance do not also show the best stability, and vice versa. In addition, the classification performance is more affected by the sampling techniques themselves rather than by the post-sampling proportions, whereas this is opposite for the stability.

[1]  Ian Witten,et al.  Data Mining , 2000 .

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[3]  Shyam Visweswaran,et al.  Measuring Stability of Feature Selection in Biomedical Datasets , 2009, AMIA.

[4]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[5]  Taghi M. Khoshgoftaar,et al.  Threshold-based feature selection techniques for high-dimensional bioinformatics data , 2012, Network Modeling Analysis in Health Informatics and Bioinformatics.

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  Daoqiang Zhang,et al.  Two-Stage Cost-Sensitive Learning for Software Defect Prediction , 2014, IEEE Transactions on Reliability.

[8]  Taghi M. Khoshgoftaar,et al.  Software Defect Prediction for High-Dimensional and Class-Imbalanced Data , 2011, SEKE.

[9]  Taghi M. Khoshgoftaar,et al.  A COMPARATIVE STUDY OF FILTER-BASED AND WRAPPER-BASED FEATURE RANKING TECHNIQUES FOR SOFTWARE QUALITY MODELING , 2011 .

[10]  Abhijit S. Pandya,et al.  The Impact of Gene Selection on Imbalanced Microarray Expression Data , 2009, BICoB.

[11]  Yue Han,et al.  Stable Gene Selection from Microarray Data via Sample Weighting , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Taghi M. Khoshgoftaar,et al.  Predicting susceptibility to social bots on Twitter , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[13]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[14]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[15]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[16]  Thomas Zimmermann,et al.  Predicting Bugs from History , 2008, Software Evolution.

[17]  Taghi M. Khoshgoftaar,et al.  A Study of Software Metric Selection Techniques: stability Analysis and Defect Prediction Model Performance , 2013, Int. J. Artif. Intell. Tools.

[18]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[19]  Josef Kittler,et al.  Improving Stability of Feature Selection Methods , 2007, CAIP.

[20]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[21]  Keith Phalp,et al.  Enhancing network based intrusion detection for imbalanced data , 2008, Int. J. Knowl. Based Intell. Eng. Syst..

[22]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[23]  Yue Jiang,et al.  Variance Analysis in Software Fault Prediction Models , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[24]  Taghi M. Khoshgoftaar,et al.  Predicting high-risk program modules by selecting the right software measurements , 2011, Software Quality Journal.

[25]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[26]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[27]  Taghi M. Khoshgoftaar,et al.  Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[28]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[29]  Amri Napolitano,et al.  A comparative study of iterative and non-iterative feature selection techniques for software defect prediction , 2013, Information Systems Frontiers.

[30]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[31]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[32]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[33]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[34]  Chris H. Q. Ding,et al.  Consensus group stable feature selection , 2009, KDD.

[35]  R. Mlynarski,et al.  New feature selection methods for qualification of the patients for cardiac pacemaker implantation , 2007, 2007 Computers in Cardiology.

[36]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[37]  Albert Y. Zomaya,et al.  Ensemble-Based Wrapper Methods for Feature Selection and Class Imbalance Learning , 2013, PAKDD.

[38]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007 .

[39]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[40]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[41]  Jesús S. Aguilar-Ruiz,et al.  Detecting Fault Modules Applying Feature Selection to Classifiers , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[42]  R. Plackett,et al.  Karl Pearson and the Chi-squared Test , 1983 .

[43]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[44]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[45]  P. Cunningham,et al.  Solutions to Instability Problems with Sequential Wrapper-based Approaches to Feature Selection , 2002 .