Predicting high-risk program modules by selecting the right software measurements

A timely detection of high-risk program modules in high-assurance software is critical for avoiding the high consequences of operational failures. While software risk can initiate from external sources, such as management or outsourcing, software quality is adversely affected when internal software risks are realized, such as improper practice of standard software processes or lack of a defined software quality infrastructure. Practitioners employ various techniques to identify and rectify high-risk or low-quality program modules. Effectiveness of detecting such modules is affected by the software measurements used, making feature selection an important step during software quality prediction. We use a wrapper-based feature ranking technique to select the optimal set of software metrics to build defect prediction models. We also address the adverse effects of class imbalance (very few low-quality modules compared to high-quality modules), a practical problem observed in high-assurance systems. Applying a data sampling technique followed by feature selection is a relatively unique contribution of our work. We present a comprehensive investigation on the impact of data sampling followed by attribute selection on the defect predictors built with imbalanced data. The case study data are obtained from several real-world high-assurance software projects. The key results are that attribute selection is more efficient when applied after data sampling, and defect prediction performance generally improves after applying data sampling and feature selection.

[1]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[2]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[3]  R. Mlynarski,et al.  New feature selection methods for qualification of the patients for cardiac pacemaker implantation , 2007, 2007 Computers in Cardiology.

[4]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[5]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[6]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[7]  Abhijit S. Pandya,et al.  The Impact of Gene Selection on Imbalanced Microarray Expression Data , 2009, BICoB.

[8]  Adam A. Porter,et al.  Experimental Software Engineering: A Report on the State of the Art , 1995, 1995 17th International Conference on Software Engineering.

[9]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[10]  Ian Witten,et al.  Data Mining , 2000 .

[11]  Nitesh V. Chawla,et al.  C4.5 and Imbalanced Data sets: Investigating the eect of sampling method, probabilistic estimate, and decision tree structure , 2003 .

[12]  Seok-Won Lee,et al.  Visual Analytics for Requirements-driven Risk Assessment , 2007, Second International Workshop on Requirements Engineering Visualization (REV 2007).

[13]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[14]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[16]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  Taghi M. Khoshgoftaar,et al.  Hybrid sampling for imbalanced data , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[19]  Chris. Drummond,et al.  C 4 . 5 , Class Imbalance , and Cost Sensitivity : Why Under-Sampling beats OverSampling , 2003 .

[20]  Taghi M. Khoshgoftaar,et al.  ATTRIBUTE SELECTION USING ROUGH SETS IN SOFTWARE QUALITY CLASSIFICATION , 2009 .

[21]  Khaled El Emam,et al.  Comparing case-based reasoning classifiers for predicting high risk software components , 2001, J. Syst. Softw..

[22]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[23]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[24]  Huan Liu,et al.  A selective sampling approach to active feature selection , 2004, Artif. Intell..

[25]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .

[26]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[27]  Taghi M. Khoshgoftaar,et al.  Emerald: Software Metrics and Models on the Desktop , 1996, IEEE Softw..

[28]  Nur Izura Udzir,et al.  A Study on Feature Selection and Classification Techniques for Automatic Genre Classification of Traditional Malay Music , 2008, ISMIR.

[29]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[30]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[31]  Hausi A. Müller,et al.  Predicting fault-proneness using OO metrics. An industrial case study , 2002, Proceedings of the Sixth European Conference on Software Maintenance and Reengineering.

[32]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[33]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[34]  Shari Lawrence Pfleeger,et al.  Software metrics (2nd ed.): a rigorous and practical approach , 1997 .

[35]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[36]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[37]  Lior Rokach,et al.  Classifier evaluation under limited resources , 2006, Pattern Recognit. Lett..

[38]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[39]  Taghi M. Khoshgoftaar,et al.  Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study , 2004, Empirical Software Engineering.

[40]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[41]  Taghi M. Khoshgoftaar,et al.  Detecting noisy instances with the rule-based classification model , 2005, Intell. Data Anal..

[42]  P.A. Jansma When management gets serious about managing software , 2005, 2005 IEEE Aerospace Conference.

[43]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[44]  Keith Phalp,et al.  Enhancing network based intrusion detection for imbalanced data , 2008, Int. J. Knowl. Based Intell. Eng. Syst..

[45]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[46]  Nello Cristianini,et al.  Support vector machines , 2009 .

[47]  Jesús S. Aguilar-Ruiz,et al.  Detecting Fault Modules Applying Feature Selection to Classifiers , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[48]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[49]  Elena Marchiori,et al.  Feature selection in proteomic pattern data with support vector machines , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[50]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[51]  Claes Wohlin,et al.  A Classification Scheme for Studies on Fault-Prone Components , 2001, PROFES.

[52]  Yan Ma,et al.  Adequate and Precise Evaluation of Quality Models in Software Engineering Studies , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[53]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[55]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[56]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[57]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..