A Comprehensive Empirical Study of Count Models for Software Fault Prediction

Count models, such as the Poisson regression model, and the negative binomial regression model, can be used to obtain software fault predictions. With the aid of such predictions, the development team can improve the quality of operational software. The zero-inflated, and hurdle count models may be more appropriate when, for a given software system, the number of modules with faults are very few. Related literature lacks quantitative guidance regarding the application of count models for software quality prediction. This study presents a comprehensive empirical investigation of eight count models in the context of software fault prediction. It includes comparative hypothesis testing, model selection, and performance evaluation for the count models with respect to different criteria. The case study presented is that of a full-scale industrial software system. It is observed that the information obtained from hypothesis testing, and model selection techniques was not consistent with the predictive performances of the count models. Moreover, the comparative analysis based on one criterion did not match that of another criterion. However, with respect to a given criterion, the performance of a count model is consistent for both the fit, and test data sets. This ensures that, if a fitted model is considered good based on a given criterion, then the model will yield a good prediction based on the same criterion. The relative performances of the eight models are evaluated based on a one-way anova model, and Tukey's multiple comparison technique. The comparative study is useful in selecting the best count model for estimating the quality of a given software system

[1]  David M. Levine,et al.  Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[2]  Taghi M. Khoshgoftaar,et al.  An application of zero-inflated Poisson regression for software fault prediction , 2001, Proceedings 12th International Symposium on Software Reliability Engineering.

[3]  Q. Vuong Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses , 1989 .

[4]  Swapna S. Gokhale,et al.  Regression Tree Modeling For The Prediction Of Software Quality , 1997 .

[5]  Diane Lambert,et al.  Zero-inflacted Poisson regression, with an application to defects in manufacturing , 1992 .

[6]  Lionel C. Briand,et al.  Assessing the Applicability of Fault-Proneness Models Across Object-Oriented Software Projects , 2002, IEEE Trans. Software Eng..

[7]  Taghi M. Khoshgoftaar,et al.  Predicting Fault-Prone Modules in Embedded Systems Using Analogy-Based Classification Models , 2002, Int. J. Softw. Eng. Knowl. Eng..

[8]  Taghi M. Khoshgoftaar,et al.  Tree-based software quality estimation models for fault prediction , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[9]  Pravin K. Trivedi,et al.  Regression Analysis of Count Data , 1998 .

[10]  Edward B. Allen,et al.  Case-Based Software Quality Prediction , 2000, Int. J. Softw. Eng. Knowl. Eng..

[11]  Gordon E. Willmot,et al.  A mixed poisson–inverse‐gaussian regression model , 1989 .

[12]  William M. Evanco,et al.  Modeling the effort to correct faults , 1995, J. Syst. Softw..

[13]  P. Deb,et al.  Demand for Medical Care by the Elderly: A Finite Mixture Approach , 1997 .

[14]  Taghi M. Khoshgoftaar,et al.  Predictive Modeling Techniques of Software Quality from Software Measures , 1992, IEEE Trans. Software Eng..

[15]  J. Mullahy Specification and testing of some modified count data models , 1986 .

[16]  Winfried Pohlmeier,et al.  An Econometric Model of the Two-Part Decisionmaking Process in the Demand for Health Care , 1995 .