An Empirical Study of Software Metrics Selection Using Support Vector Machine

The objective of feature selection is to identify irrelevant and redundant features, which can then be discarded from the analysis. Reducing the number of metrics (features) in a data set can lead to faster software quality model training and improved classifier performance. In this study we focus on feature ranking using linear Support Vector Machines (SVM) which is implemented in WEKA. The contribution of this study is to provide an extensive empirical evaluation of SVM rankers built from imbalanced data. Should the features be removed at each iteration? What should the recommended value be for the tolerance parameter? We address these and other related issues in this work.

[1]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[2]  Taghi M. Khoshgoftaar,et al.  High-Dimensional Software Engineering Data and Feature Selection , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[3]  Barry W. Boehm,et al.  Finding the right data for software cost modeling , 2005, IEEE Software.

[4]  Isabelle Guyon,et al.  Design and Analysis of the Causation and Prediction Challenge , 2008, WCCI Causation and Prediction Challenge.

[5]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[6]  Chih-Jen Lin,et al.  Feature Ranking Using Linear SVM , 2008, WCCI Causation and Prediction Challenge.

[7]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[8]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[9]  Lawrence O. Hall,et al.  Feature selection for microarray data by AUC analysis , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[10]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[11]  Taghi M. Khoshgoftaar,et al.  An empirical investigation of filter attribute selection techniques for software quality classification , 2009, 2009 IEEE International Conference on Information Reuse & Integration.

[12]  Taghi M. Khoshgoftaar,et al.  A Comparative Study of Ensemble Feature Selection Techniques for Software Defect Prediction , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[13]  Yue Jiang,et al.  Variance Analysis in Software Fault Prediction Models , 2009, 2009 20th International Symposium on Software Reliability Engineering.

[14]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[15]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[16]  Jesús S. Aguilar-Ruiz,et al.  Detecting Fault Modules Applying Feature Selection to Classifiers , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[17]  Hongfang Liu,et al.  An Investigation into the Functional Form of the Size-Defect Relationship for Software Modules , 2009, IEEE Transactions on Software Engineering.

[18]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[20]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[21]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[22]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[23]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .