Software Defect Prediction: Heuristics for Weighted Naïve Bayes

Defect prediction is an important topic in software quality research. Statistical models for defect prediction can be built on project repositories. Project repositories store software metrics and defect information. This information is then matched with software modules. Naive Bayes is a well known, simple statistical technique that assumes the ‘independence’ and ‘equal importance’ of features, which are not true in many problems. However, Naive Bayes achieves high performances on a wide spectrum of prediction problems. This paper addresses the ‘equal importance’ of features assumption of Naive Bayes. We propose that by means of heuristics we can assign weights to features according to their importance and improve defect prediction performance. We compare the weighted Naive Bayes and the standard Naive Bayes predictors’ performances on publicly available datasets. Our experimental results indicate that assigning weights to software metrics increases the prediction performance significantly.

[1]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[2]  Qinbao Song,et al.  Software defect association mining and defect correction effort prediction , 2006, IEEE Transactions on Software Engineering.

[3]  Taghi M. Khoshgoftaar,et al.  Regression modelling of software quality: empirical investigation☆ , 1990 .

[4]  Norman E. Fenton,et al.  A Critique of Software Defect Prediction Models , 1999, IEEE Trans. Software Eng..

[5]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[6]  Stefan Biffl,et al.  Optimal project feature weights in analogy-based cost estimation: improvement and limitations , 2006, IEEE Transactions on Software Engineering.

[7]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[8]  Bogdan Korel,et al.  Requirement-based automated black-box test generation , 2001, 25th Annual International Computer Software and Applications Conference. COMPSAC 2001.

[9]  Thomas Ragg,et al.  Using machine learning for estimating the defect content after an inspection , 2004, IEEE Transactions on Software Engineering.

[10]  Mary Jean Harrold,et al.  Testing: a roadmap , 2000, ICSE '00.

[11]  Tim Menzies,et al.  Learning early lifecycle IV & V quality indicators , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[12]  Bernhard Pfahringer,et al.  Locally Weighted Naive Bayes , 2002, UAI.

[13]  Harry Zhang,et al.  Learning weighted naive Bayes with accurate ranking , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[14]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[15]  Taghi M. Khoshgoftaar,et al.  The Detection of Fault-Prone Programs , 1992, IEEE Trans. Software Eng..

[16]  Darrel C. Ince,et al.  A critique of three metrics , 1994, J. Syst. Softw..

[17]  Tim Menzies,et al.  Assessing Predictors of Software Defects , 2004 .

[18]  Taghi M. Khoshgoftaar,et al.  Fault Prediction Modeling for Software Quality Estimation: Comparing Commonly Used Techniques , 2003, Empirical Software Engineering.

[19]  Victor R. Basili,et al.  A Validation of Object-Oriented Design Metrics as Quality Indicators , 1996, IEEE Trans. Software Eng..

[20]  Norman E. Fenton,et al.  Quantitative Analysis of Faults and Failures in a Complex Software System , 2000, IEEE Trans. Software Eng..

[21]  Geoffrey I. Webb,et al.  Lazy Learning of Bayesian Rules , 2000, Machine Learning.

[22]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[23]  Mark A. Hall,et al.  A decision tree-based attribute weighting filter for naive Bayes , 2006, Knowl. Based Syst..

[24]  Pekka Abrahamsson,et al.  Providing test quality feedback using static source code and automatic test suite metrics , 2005, 16th IEEE International Symposium on Software Reliability Engineering (ISSRE'05).

[25]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .