论文信息 - High-Dimensional Software Engineering Data and Feature Selection

High-Dimensional Software Engineering Data and Feature Selection

Software metrics collected during project development play a critical role in software quality assurance. A software practitioner is very keen on learning which software metrics to focus on for software quality prediction. While a concise set of software metrics is often desired, a typical project collects a very large number of metrics. Minimal attention has been devoted to finding the minimum set of software metrics that have the same predictive capability as a larger set of metrics – we strive to answer that question in this paper. We present a comprehensive comparison between seven commonly-used filter-based feature ranking techniques (FRT) and our proposed hybrid feature selection (HFS) technique. Our case study consists of a very highdimensional (42 software attributes) software measurement data set obtained from a large telecommunications system. The empirical analysis indicates that HFS performs better than FRT; however, the Kolmogorov-Smirnov feature ranking technique demonstrates competitive performance. For the telecommunications system, it is found that only 10% of the software attributes are sufficient for effective software quality prediction.

[1] Meine van der Meulen,et al. Correlations between Internal Software Metrics and Software Dependability in a Large Population of Small C/C++ Programs , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[2] Pat Langley,et al. Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[3] Nur Izura Udzir,et al. A Study on Feature Selection and Classification Techniques for Automatic Genre Classification of Traditional Malay Music , 2008, ISMIR.

[4] Taghi M. Khoshgoftaar,et al. Application of an attribute selection method to CBR-based software quality classification , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[5] T. Zimmermann,et al. Predicting Faults from Cached History , 2007, 29th International Conference on Software Engineering (ICSE'07).

[6] Andreas Zeller,et al. Predicting faults from cached history , 2008, ISEC '08.

[7] Pravin K. Trivedi,et al. Regression Analysis of Count Data: Preface , 1998 .

[8] Taghi M. Khoshgoftaar,et al. ATTRIBUTE SELECTION USING ROUGH SETS IN SOFTWARE QUALITY CLASSIFICATION , 2009 .

[9] Jesús S. Aguilar-Ruiz,et al. Attribute Selection in Software Engineering Datasets for Detecting Fault Modules , 2007, 33rd EUROMICRO Conference on Software Engineering and Advanced Applications (EUROMICRO 2007).

[10] Tom Fawcett,et al. An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[11] Shari Lawrence Pfleeger,et al. Software Metrics : A Rigorous and Practical Approach , 1998 .

[12] S. Cessie,et al. Ridge Estimators in Logistic Regression , 1992 .

[13] Andreas Christmann,et al. Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[14] Nello Cristianini,et al. Support vector machines , 2009 .

[15] Jesús S. Aguilar-Ruiz,et al. Detecting Fault Modules Applying Feature Selection to Classifiers , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[16] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[17] Taghi M. Khoshgoftaar,et al. EMERALD: software metrics and models on the desktop , 1996, Proceedings of the Fourth International Symposium on Assessment of Software Tools.

[18] Geoff Holmes,et al. Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[19] Isabelle Guyon,et al. An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[20] David M. Levine,et al. Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[21] Bart Baesens,et al. Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[22] Pravin K. Trivedi,et al. Regression Analysis of Count Data , 1998 .

[23] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[24] Simon Haykin,et al. Neural Networks: A Comprehensive Foundation , 1998 .

[25] Larry A. Rendell,et al. A Practical Approach to Feature Selection , 1992, ML.

[26] Taghi M. Khoshgoftaar,et al. A Novel Hybrid Search Algorithm for Feature Selection , 2009, International Conference on Software Engineering and Knowledge Engineering.

[27] Elena Marchiori,et al. Feature selection in proteomic pattern data with support vector machines , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[28] David W. Aha,et al. Instance-Based Learning Algorithms , 1991, Machine Learning.

[29] R. Mlynarski,et al. New feature selection methods for qualification of the patients for cardiac pacemaker implantation , 2007, 2007 Computers in Cardiology.

[30] Huan Liu,et al. Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.