High-Dimensional Software Engineering Data and Feature Selection

Software metrics collected during project development play a critical role in software quality assurance. A software practitioner is very keen on learning which software metrics to focus on for software quality prediction. While a concise set of software metrics is often desired, a typical project collects a very large number of metrics. Minimal attention has been devoted to finding the minimum set of software metrics that have the same predictive capability as a larger set of metrics – we strive to answer that question in this paper. We present a comprehensive comparison between seven commonly-used filter-based feature ranking techniques (FRT) and our proposed hybrid feature selection (HFS) technique. Our case study consists of a very highdimensional (42 software attributes) software measurement data set obtained from a large telecommunications system. The empirical analysis indicates that HFS performs better than FRT; however, the Kolmogorov-Smirnov feature ranking technique demonstrates competitive performance. For the telecommunications system, it is found that only 10% of the software attributes are sufficient for effective software quality prediction.

[1]  Meine van der Meulen,et al.  Correlations between Internal Software Metrics and Software Dependability in a Large Population of Small C/C++ Programs , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[2]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[3]  Nur Izura Udzir,et al.  A Study on Feature Selection and Classification Techniques for Automatic Genre Classification of Traditional Malay Music , 2008, ISMIR.

[4]  Taghi M. Khoshgoftaar,et al.  Application of an attribute selection method to CBR-based software quality classification , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[5]  T. Zimmermann,et al.  Predicting Faults from Cached History , 2007, 29th International Conference on Software Engineering (ICSE'07).

[6]  Andreas Zeller,et al.  Predicting faults from cached history , 2008, ISEC '08.

[7]  Pravin K. Trivedi,et al.  Regression Analysis of Count Data: Preface , 1998 .

[8]  Taghi M. Khoshgoftaar,et al.  ATTRIBUTE SELECTION USING ROUGH SETS IN SOFTWARE QUALITY CLASSIFICATION , 2009 .

[9]  Jesús S. Aguilar-Ruiz,et al.  Attribute Selection in Software Engineering Datasets for Detecting Fault Modules , 2007, 33rd EUROMICRO Conference on Software Engineering and Advanced Applications (EUROMICRO 2007).

[10]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[11]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .

[12]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[13]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[14]  Nello Cristianini,et al.  Support vector machines , 2009 .

[15]  Jesús S. Aguilar-Ruiz,et al.  Detecting Fault Modules Applying Feature Selection to Classifiers , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  Taghi M. Khoshgoftaar,et al.  EMERALD: software metrics and models on the desktop , 1996, Proceedings of the Fourth International Symposium on Assessment of Software Tools.

[18]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[19]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[20]  David M. Levine,et al.  Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[21]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[22]  Pravin K. Trivedi,et al.  Regression Analysis of Count Data , 1998 .

[23]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[24]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[25]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[26]  Taghi M. Khoshgoftaar,et al.  A Novel Hybrid Search Algorithm for Feature Selection , 2009, International Conference on Software Engineering and Knowledge Engineering.

[27]  Elena Marchiori,et al.  Feature selection in proteomic pattern data with support vector machines , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[28]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[29]  R. Mlynarski,et al.  New feature selection methods for qualification of the patients for cardiac pacemaker implantation , 2007, 2007 Computers in Cardiology.

[30]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.