THE USE OF UNDER- AND OVERSAMPLING WITHIN ENSEMBLE FEATURE SELECTION AND CLASSIFICATION FOR SOFTWARE QUALITY PREDICTION

Software quality prediction models are useful tools for creating high quality software products. The general process is that practitioners use software metrics and defect data along with various data mining techniques to build classification models for identifying potentially faulty program modules, thereby enabling effective project resource allocation. The predictive accuracy of these classification models is often affected by the quality of input data. Two main problems which can affect the quality of input data are high dimensionality (too many independent attributes in a dataset) and class imbalance (many more members of one class than the other class in a binary classification problem). To resolve both of these problems, we present an iterative feature selection approach which repeatedly applies data sampling (to overcome class imbalance) followed by feature selection (to overcome high dimensionality), and finally combines the ranked feature lists from the separate iterations of sampling. After feature selection, models are built either using a plain learner or by using a boosting algorithm which incorporates sampling. In order to assess the impact of various balancing, filter, and learning techniques in the feature selection and model-building process on software quality prediction, we employ two sampling techniques, random undersampling (RUS) and synthetic minority oversampling technique (SMOTE), and two ensemble boosting approaches, RUSBoost and SMOTEBoost (in which RUS and SMOTE, respectively, are integrated into a boosting technique), as well as six feature ranking techniques. We apply the proposed techniques to several groups of datasets from two real-world software systems and use two learners to build classification models. The experimental results demonstrate that RUS results in better prediction than SMOTE, and also that boosting is more effective in improving classification performance than not using boosting. In addition, some feature ranking techniques, like chi-squared and information gain, exhibit better and more stable classification behavior than other rankers.

[1]  Elena Marchiori,et al.  Feature selection in proteomic pattern data with support vector machines , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[2]  Ian Witten,et al.  Data Mining , 2000 .

[3]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[4]  Amri Napolitano,et al.  A comparative study of iterative and non-iterative feature selection techniques for software defect prediction , 2013, Information Systems Frontiers.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[7]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[8]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[9]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[10]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[11]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[12]  Yue Han,et al.  Stable Gene Selection from Microarray Data via Sample Weighting , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Taghi M. Khoshgoftaar,et al.  Evaluation of the importance of data pre-processing order when combining feature selection and data sampling , 2012, Int. J. Bus. Intell. Data Min..

[14]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[15]  Taghi M. Khoshgoftaar,et al.  Emerald: Software Metrics and Models on the Desktop , 1996, IEEE Softw..

[16]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[17]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[18]  Taghi M. Khoshgoftaar,et al.  Feature Selection for Highly Imbalanced Software Measurement Data , 2012 .

[19]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[20]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[21]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[22]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[23]  Huan Liu,et al.  A selective sampling approach to active feature selection , 2004, Artif. Intell..

[24]  Adam A. Porter,et al.  Experimental Software Engineering: A Report on the State of the Art , 1995, 1995 17th International Conference on Software Engineering.

[25]  Nur Izura Udzir,et al.  A Study on Feature Selection and Classification Techniques for Automatic Genre Classification of Traditional Malay Music , 2008, ISMIR.