A Two-Stage Feature Weighting Method for Naive Bayes and Its Application in Software Defect Prediction

Software defect prediction (SDP) models facilitate software practitioners to find out defect-prone software modules in software. Software practitioners can then test these defect-prone software modules with limited testing resources to minimize software defects. Among various SDP models, Naive Bayes (NB) has been widely used in SDP because of its simplicity, effectiveness and robustness. The NB classifier is an effective classification approach, especially for data sets with discrete attributes. In NB, the attributes are assumed to be independent and thus equally important. However, in common practice, the attributes of software defect data sets are usually continuous or numeric, and because they are designed for different purposes, their contributions to prediction are different. Therefore, this paper proposes a new NB method called TSWNB, which contains two stages: feature (i.e. attribute) discretization and feature weighting. More specifically, for the stage of feature discretization, we make the comparison between two discretization methods i.e. equal-width discretization method and equal-frequency discretization method, and identify the most appropriate one. For the stage of feature weighting, we use the feature weighting technique to alleviate the equal importance assumption, which combines the obtained feature weights into the NB formula and its likelihood estimations. To evaluate the proposed method, we carry out experiments on 5 software defect data sets of NASA MDP provided by PROMISE repository. Three well-known classification algorithms and two feature weighting techniques are included for comparison. The experimental results reveal the effectiveness and practicability of the two-stage feature weighting method TSWNB.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Edward B. Allen,et al.  Case-Based Software Quality Prediction , 2000, Int. J. Softw. Eng. Knowl. Eng..

[3]  Shasha Wang,et al.  Deep feature weighting for naive Bayes and its application to text classification , 2016, Eng. Appl. Artif. Intell..

[4]  Ayse Basar Bener,et al.  Analysis of Naive Bayes' assumptions on software fault data: An empirical study , 2009, Data Knowl. Eng..

[5]  Ruchika Malhotra,et al.  An empirical framework for defect prediction using machine learning techniques with Android software , 2016, Appl. Soft Comput..

[6]  Ayça Tarhan,et al.  Early software defect prediction: A systematic map and review , 2018, J. Syst. Softw..

[7]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[8]  Harry Zhang,et al.  Learning weighted naive Bayes with accurate ranking , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[9]  Ivan Bratko,et al.  ASSISTANT 86: A Knowledge-Elicitation Tool for Sophisticated Users , 1987, EWSL.

[10]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[11]  Tzu-Tsung Wong,et al.  A hybrid discretization method for naïve Bayesian classifiers , 2012, Pattern Recognit..

[12]  Sandeep Kumar,et al.  Linear and non-linear heterogeneous ensemble methods to predict the number of faults in software systems , 2017, Knowl. Based Syst..

[13]  Khaled El Emam,et al.  Evaluating Capture-Recapture Models with Two Inspectors , 2001, IEEE Trans. Software Eng..

[14]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[15]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[16]  Nader B. Ebrahimi,et al.  On the Statistical Analysis of the Number of Errors Remaining in a Software Design Document after Inspection , 1997, IEEE Trans. Software Eng..

[17]  Ayse Basar Bener,et al.  Software Defect Prediction: Heuristics for Weighted Naïve Bayes , 2007, ICSOFT.

[18]  Letha H. Etzkorn,et al.  Empirical Validation of Three Software Metrics Suites to Predict Fault-Proneness of Object-Oriented Classes Developed Using Highly Iterative or Agile Software Development Processes , 2007, IEEE Transactions on Software Engineering.

[19]  John Yearwood,et al.  A Framework for Software Defect Prediction and Metric Selection , 2018, IEEE Access.

[20]  Rebecca N. Wright,et al.  A Practical Differentially Private Random Decision Tree Classifier , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[21]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[22]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[23]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.