Prediction of Fault-Prone Software Modules using Statistical and Machine Learning Methods

Demand for producing quality software has rapidly increased during the last few years. This is leading to increase in development of machine learning methods for exploring data sets, which can be used in constructing models for predicting quality attributes such as fault proneness, maintenance effort, testing effort, productivity and reliability. This paper examines and compares logistic regression and six machine learning methods (Artificial neural network, decision tree, support vector machine, cascade correlation network, group method of data handling polynomial method, gene expression programming). These methods are explored empirically to find the effect of static code metrics on the fault proneness of software modules. We use publicly available data set AR1 to analyze and compare the regression and machine learning methods in this study. The performance of the methods is compared by computing the area under the curve using Receiver Operating Characteristic (ROC) analysis. The results show that the area under the curve (measured from the ROC analysis) of model predicted using decision tree modeling is 0.865 and is a better model than the model predicted using regression and other machine learning methods. The study shows that the machine learning methods are useful in constructing software quality models. The full text of the article is not available in the cache. Kindly refer the IJCA digital library at www.ijcaonline.org for the complete article. In case, you face problems while downloading the full-text, please send a mail to editor at editor@ijcaonline.org

[1]  A. Kaur,et al.  Application of Random Forest in Predicting Fault-Prone Classes , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[2]  Arvinder Kaur,et al.  Empirical analysis for investigating the effect of object-oriented metrics on fault proneness: a replicated case study , 2009 .

[3]  Ekrem Duman Information systems in financial markets, e-business, banking, accounting, marketing - comparison of decision tree algorithms in identifying bank customers who are likely to buy credit cards , 2006 .

[4]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[5]  Marco Furini,et al.  International Journal of Computer and Applications , 2010 .

[6]  Maurice H. Halstead,et al.  Elements of software science , 1977 .

[7]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[8]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[9]  Sallie M. Henry,et al.  Software Structure Metrics Based on Information Flow , 1981, IEEE Transactions on Software Engineering.

[10]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[11]  Taghi M. Khoshgoftaar,et al.  MODELING SOFTWARE QUALITY WITH CLASSIFICATION TREES , 2001 .

[12]  John C. Munson,et al.  Software evolution: code delta and code churn , 2000, J. Syst. Softw..

[13]  Noboru Takagi,et al.  An application of support vector machines to chinese character classification problem , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[14]  John C. Munson,et al.  Developing fault predictors for evolving software systems , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[15]  Adam A. Porter,et al.  Empirically guided software development using metric-based classification trees , 1990, IEEE Software.

[16]  Mohammad Ghodsi,et al.  Comparison of artificial neural network and logistic regression models for prediction of mortality in head trauma based on initial clinical data , 2005, BMC Medical Informatics Decis. Mak..

[17]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[18]  Lucila Ohno-Machado,et al.  Logistic regression and artificial neural network classification models: a methodology review , 2002, J. Biomed. Informatics.

[19]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[20]  Marvin V. Zelkowitz,et al.  Complexity Measure Evaluation and Selection , 1995, IEEE Trans. Software Eng..

[21]  Cândida Ferreira,et al.  Gene Expression Programming: A New Adaptive Algorithm for Solving Problems , 2001, Complex Syst..

[22]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[23]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[24]  Khaled El Emam,et al.  A Validation of Object-oriented Metrics , 1999 .

[25]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[26]  Andrea D. Magrì,et al.  Artificial neural networks in chemometrics: History, examples and perspectives , 2008 .

[27]  Taghi M. Khoshgoftaar,et al.  An application of zero-inflated Poisson regression for software fault prediction , 2001, Proceedings 12th International Symposium on Software Reliability Engineering.

[28]  Arvinder Kaur,et al.  Empirical validation of object-oriented metrics for predicting fault proneness models , 2010, Software Quality Journal.

[29]  J. Hamers,et al.  [Methods and techniques]. , 1997, Verpleegkunde.

[30]  Douglas Fisher,et al.  Machine Learning Approaches to Estimating Software Development Effort , 1995, IEEE Trans. Software Eng..

[31]  N. Nagappan,et al.  Static analysis tools as early indicators of pre-release defect density , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[32]  Taghi M. Khoshgoftaar,et al.  Application of neural networks to software quality modeling of a very large telecommunications system , 1997, IEEE Trans. Neural Networks.

[33]  Tim Menzies,et al.  Learning early lifecycle IV & V quality indicators , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[34]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[35]  Taghi M. Khoshgoftaar,et al.  Fault Prediction Modeling for Software Quality Estimation: Comparing Commonly Used Techniques , 2003, Empirical Software Engineering.

[36]  K. K. Aggarwal,et al.  Empirical analysis for investigating the effect of object-oriented metrics on fault proneness: a replicated case study , 2009, Softw. Process. Improv. Pract..

[37]  Xue Wang,et al.  Fault Recognition with Labeled Multi-category Support Vector Machine , 2007, Third International Conference on Natural Computation (ICNC 2007).