Comparing Feature Selection Techniques for Software Quality Estimation Using Data-Sampling-Based Boosting Algorithms

Software defect prediction is a classification technique that utilizes software metrics and fault data collected during the software development process to identify fault-prone modules before the testing phase. It aims to optimize project resource allocation and eventually improve the quality of software products. However, two factors, high dimensionality and class imbalance, may cause low quality training data and subsequently degrade classification models. Feature (software metric) selection and data sampling are frequently used to overcome these problems. Feature selection (FS) is a process of choosing a subset of relevant features so that the quality of prediction models can be maintained or improved. Data sampling alters the dataset to change its balance level, therefore alleviating the problem of traditional classification models that are biased toward the overrepresented (majority) class. A recent study shows that another method, called boosting (building multiple models, with each model tuned to work better on instances misclassified by previous models), is also effective for addressing the class imbalance problem. In this paper, we present a technique that uses FS followed by a boosting algorithm in the context of software quality estimation. We investigate four FS approaches: individual FS, repetitive sampled FS, sampled ensemble FS, and repetitive sampled ensemble FS, and study the impact of the four approaches on the quality of the prediction models. Ten base feature ranking techniques are examined in the case study. We also employ the boosting algorithm to construct classification models with no FS and use the results as the baseline for further comparison. The empirical results demonstrate that (1) FS is important and necessary prior to the learning process; (2) the repetitive sampled FS method generally has similar performance to the individual FS technique; and (3) the ensemble filter (including sampled ensemble filter and repetitive sampled ensemble filter) performs better than or similarly to the average of the corresponding individual base rankers.

[1]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[2]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[3]  Amri Napolitano,et al.  A comparative study of iterative and non-iterative feature selection techniques for software defect prediction , 2013, Information Systems Frontiers.

[4]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[5]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[6]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Feature Ranking Techniques for Software Quality Prediction , 2012, Int. J. Softw. Eng. Knowl. Eng..

[7]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[8]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[9]  Sabrina Ahmad,et al.  Metaheuristic Optimization based Feature Selection for Software Defect Prediction , 2014, J. Softw..

[10]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[11]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[12]  Qinbao Song,et al.  A General Software Defect-Proneness Prediction Framework , 2011, IEEE Transactions on Software Engineering.

[13]  Swarun Kumar,et al.  LTE radio analytics made easy and accessible , 2015, SIGCOMM 2015.

[14]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Taghi M. Khoshgoftaar,et al.  Threshold-based feature selection techniques for high-dimensional bioinformatics data , 2012, Network Modeling Analysis in Health Informatics and Bioinformatics.

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  Neeraj Kumar Goyal,et al.  Predicting Fault-prone Software Module Using Data Mining Technique and Fuzzy Logic , 2010 .

[19]  K. E. Kannammal,et al.  A HYBRID FEATURE SELECTION MODEL FOR SOFTWARE FAULT PREDICTION , 2012 .

[20]  Taghi M. Khoshgoftaar,et al.  Predicting susceptibility to social bots on Twitter , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[21]  R. Mlynarski,et al.  New feature selection methods for qualification of the patients for cardiac pacemaker implantation , 2007, 2007 Computers in Cardiology.

[22]  Atul Gupta,et al.  A comparative study of feature-ranking and feature-subset selection techniques for improved fault prediction , 2014, ISEC '14.

[23]  Taghi M. Khoshgoftaar,et al.  Predicting high-risk program modules by selecting the right software measurements , 2011, Software Quality Journal.

[24]  Taghi M. Khoshgoftaar,et al.  THE USE OF UNDER- AND OVERSAMPLING WITHIN ENSEMBLE FEATURE SELECTION AND CLASSIFICATION FOR SOFTWARE QUALITY PREDICTION , 2014 .

[25]  Monika Jain,et al.  An Analysis of the Methods Employed for Breast Cancer Diagnosis , 2012, ArXiv.