Feature Selection with Imbalanced Data for Software Defect Prediction

In this paper, we study the learning impact of data sampling followed by attribute selection on the classification models built with binary class imbalanced data within the scenario of software quality engineering. We use a wrapper-based attribute ranking technique to select a subset of attributes, and the random undersampling technique (RUS) on the majority class to alleviate the negative effects of imbalanced data on the prediction models. The datasets used in the empirical study were collected from numerous software projects. Five data preprocessing scenarios were explored in these experiments, including: (1) training on the original, unaltered fit dataset, (2) training on a sampled version of the fit dataset, (3) training on an unsampled version of the fit dataset using only the attributes chosen by feature selection based on the unsampled fit dataset, (4) training on an unsampled version of the fit dataset using only the attributes chosen by feature selection based on a sampled version of the fit dataset, and (5) training on a sampled version of the fit dataset using only the attributes chosen by feature selection based on the sampled version of the fit dataset. We compared the performances of the classification models constructed over these five different scenarios. The results demonstrate that the classification models constructed on the sampled fit data with or without feature selection (case 2 and case 5) significantly outperformed the classification models built with the other cases (unsampled fit data). Moreover, the two scenarios using sampled data (case 2 and case 5) showed very similar performances, but the subset of attributes (case 5) is only around 15% or 30% of the complete set of attributes (case 2).

[1]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[2]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[3]  Jesús S. Aguilar-Ruiz,et al.  Detecting Fault Modules Applying Feature Selection to Classifiers , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[4]  David M. Levine,et al.  Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[5]  Lior Rokach,et al.  Classifier evaluation under limited resources , 2006, Pattern Recognit. Lett..

[7]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[8]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[9]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[10]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[11]  Taghi M. Khoshgoftaar,et al.  ATTRIBUTE SELECTION USING ROUGH SETS IN SOFTWARE QUALITY CLASSIFICATION , 2009 .

[12]  Zhaolei Zhang,et al.  Modifying kernels using label information improves SVM classification performance , 2007, ICMLA 2007.

[13]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.