Improving Software Quality Estimation by Combining Boosting and Feature Selection

The predictive accuracy of a classification modelis often affected by the quality of training data. However, there are two problems which may affect the quality of the training data: high dimensionality (too many independent attributes in a dataset) and class imbalance (many more instances of one class than the other class in a binary-classification problem). In this study, we present an iterative feature selection approach working with an ensemble learning method to solve both of these problems. The iterative feature selection approach samples the dataset k times and applies feature ranking to each sampled dataset, the k different rankings are then aggregated to create a single feature ranking. The ensemble learning method used is RUSBoost, in which random under sampling(RUS) is integrated into a boosting algorithm. The main purpose of this paper is to investigate the impact of feature selection as well as the RUSBoost approach on the classification performance in the context of software quality prediction. In the experiment, we explore six rankers, each used along with RUS in the iterative feature selection process. Following feature selection, models are built either using a plain learner or byusing the RUSBoost algorithm. We also examine the case of no feature selection and use this as the baseline for comparisons. The experimental results demonstrate that with the exception of one learner, feature selection combined with boosting provides better classification performance than when either is applied alone or when neither are applied.

[1]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[2]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[3]  Huan Liu,et al.  A selective sampling approach to active feature selection , 2004, Artif. Intell..

[4]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[5]  Nur Izura Udzir,et al.  A Study on Feature Selection and Classification Techniques for Automatic Genre Classification of Traditional Malay Music , 2008, ISMIR.

[6]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[7]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[8]  Elena Marchiori,et al.  Feature selection in proteomic pattern data with support vector machines , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[9]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[10]  Taghi M. Khoshgoftaar,et al.  Exploring Ensemble-Based Data Preprocessing Techniques for Software Quality Estimation , 2013, SEKE.

[11]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[12]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Feature Ranking Techniques for Software Quality Prediction , 2012, Int. J. Softw. Eng. Knowl. Eng..

[13]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[15]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[16]  Amri Napolitano,et al.  A comparative study of iterative and non-iterative feature selection techniques for software defect prediction , 2013, Information Systems Frontiers.

[17]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[18]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[19]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[20]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[21]  Yue Han,et al.  Stable Gene Selection from Microarray Data via Sample Weighting , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..