A feature selection approach based on a similarity measure for software defect prediction

Software defect prediction is aimed to find potential defects based on historical data and software features. Software features can reflect the characteristics of software modules. However, some of these features may be more relevant to the class (defective or non-defective), but others may be redundant or irrelevant. To fully measure the correlation between different features and the class, we present a feature selection approach based on a similarity measure (SM) for software defect prediction. First, the feature weights are updated according to the similarity of samples in different classes. Second, a feature ranking list is generated by sorting the feature weights in descending order, and all feature subsets are selected from the feature ranking list in sequence. Finally, all feature subsets are evaluated on a k-nearest neighbor (KNN) model and measured by an area under curve (AUC) metric for classification performance. The experiments are conducted on 11 National Aeronautics and Space Administration (NASA) datasets, and the results show that our approach performs better than or is comparable to the compared feature selection approaches in terms of classification performance.

[1]  Shane McIntosh,et al.  Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[2]  Azuraliza Abu Bakar,et al.  Hybrid feature selection based on enhanced genetic algorithm for text categorization , 2016, Expert Syst. Appl..

[3]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Taghi M. Khoshgoftaar,et al.  On the Stability of Feature Selection Methods in Software Quality Prediction: An Empirical Investigation , 2015, Int. J. Softw. Eng. Knowl. Eng..

[5]  Donghai Guan,et al.  Topological Similarity-Based Feature Selection for Graph Classification , 2015, Comput. J..

[6]  Tim Menzies,et al.  Heterogeneous Defect Prediction , 2015, IEEE Transactions on Software Engineering.

[7]  Baowen Xu,et al.  Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning , 2015, ESEC/SIGSOFT FSE.

[8]  Juan-Zi Li,et al.  A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure , 2015, Inf. Sci..

[9]  Mohammad Alshayeb,et al.  Software defect prediction using ensemble learning on selected features , 2015, Inf. Softw. Technol..

[10]  Amri Napolitano,et al.  A comparative study of iterative and non-iterative feature selection techniques for software defect prediction , 2013, Information Systems Frontiers.

[11]  Xiang Chen,et al.  FECAR: A Feature Selection Framework for Software Defect Prediction , 2014, 2014 IEEE 38th Annual Computer Software and Applications Conference.

[12]  Xiao-Yuan Jing,et al.  Software defect prediction based on collaborative representation classification , 2014, ICSE Companion.

[13]  Jin Liu,et al.  Dictionary learning based software defect prediction , 2014, ICSE.

[14]  Will N. Browne,et al.  Particle Swarm Optimisation for Feature Selection in Classification : A Multi-Objective Approach , 2012 .

[15]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[16]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[17]  Daoqiang Zhang,et al.  Cost-sensitive feature selection with application in software defect prediction , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[18]  Chen Lin,et al.  An Unsupervised Feature Selection Approach Based on Mutual Information , 2012 .

[19]  Bruce Christianson,et al.  The misuse of the NASA metrics data program data sets for automated software defect prediction , 2011, EASE.

[20]  Taghi M. Khoshgoftaar,et al.  Choosing software metrics for defect prediction: an investigation on feature selection techniques , 2011, Softw. Pract. Exp..

[21]  Yue Jiang,et al.  Variance Analysis in Software Fault Prediction Models , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[22]  Lei Liu,et al.  Feature selection with dynamic mutual information , 2009, Pattern Recognit..

[23]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[24]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  T. Wieczorek,et al.  Comparison of feature ranking methods based on information entropy , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[27]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[28]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[29]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[30]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[31]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[32]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[33]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[34]  Jaechang Nam,et al.  CLAMI: Defect Prediction on Unlabeled Datasets , 2015, ASE 2015.

[35]  A. Karegowda,et al.  COMPARATIVE STUDY OF ATTRIBUTE SELECTION USING GAIN RATIO AND CORRELATION BASED FEATURE SELECTION , 2010 .

[36]  Huan Liu,et al.  Feature Selection: An Ever Evolving Frontier in Data Mining , 2010, FSDM.

[37]  杨胜,et al.  Feature selection based on mutual information and redundancy-synergy coefficient , 2004 .

[38]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[39]  Maurice H. Halstead,et al.  Elements of software science , 1977 .