A novel modified undersampling (MUS) technique for software defect prediction

Background and aim: Many sophisticated data mining and machine learning algorithms have been used for software defect prediction (SDP) to enhance the quality of software. However, real‐world SDP data sets suffer from class imbalance, which leads to a biased classifier and reduces the performance of existing classification algorithms resulting in an inaccurate classification and prediction. This work aims to improve the class imbalance nature of data sets to increase the accuracy of defect prediction and decrease the processing time. Methodology: The proposed model focuses on balancing the class of data sets to increase the accuracy of prediction and decrease processing time. It consists of a modified undersampling method and a correlation feature selection (CFS) method. Results: The results from ten open source project data sets showed that the proposed model improves the accuracy in terms of F1‐score to 0.52 ∼ 0.96, and hence it is proximity reached best F1‐score value in 0.96 near to 1 then it is given a perfect performance in the prediction process. Conclusion: The proposed model focuses on balancing the class of data sets to increase the accuracy of prediction and decrease processing time using the proposed model.

[1]  P. Manikandan,et al.  IMBALANCED DATASET CLASSIFICATION AND SOLUTIONS : A REVIEW , 2014 .

[2]  Md Zahidul Islam,et al.  RBClust: High quality class-specific clustering using rule-based classification , 2016, ESANN.

[3]  LIANGXIAO JIANG,et al.  Discriminatively Weighted Naive Bayes and its Application in Text Classification , 2012, Int. J. Artif. Intell. Tools.

[4]  Ömer Faruk Arar,et al.  A feature dependent Naive Bayes approach and its application to the software defect prediction problem , 2017, Appl. Soft Comput..

[5]  Mohammad Imran,et al.  A Novel Technique on Class Imbalance Big Data using Analogous under Sampling Approach , 2018 .

[6]  Shasha Wang,et al.  Cost-sensitive Bayesian network classifiers , 2014, Pattern Recognit. Lett..

[7]  Md Zahidul Islam,et al.  Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem , 2015, Inf. Syst..

[8]  A.H. Yousef,et al.  Software Projects Success Factors Identification using Data Mining , 2006, 2006 International Conference on Computer Engineering and Systems.

[9]  Xiao-Yuan Jing,et al.  Label propagation based semi-supervised learning for software defect prediction , 2016, Automated Software Engineering.

[10]  Chaoqun Li,et al.  A New Feature Selection Approach to Naive Bayes Text Classifiers , 2016, Int. J. Pattern Recognit. Artif. Intell..

[11]  Naoyasu Ubayashi,et al.  Studying just-in-time defect prediction using cross-project models , 2015, Empirical Software Engineering.

[12]  Xinli Yang,et al.  TLEL: A two-layer ensemble learning approach for just-in-time defect prediction , 2017, Inf. Softw. Technol..

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  M. A. Rana,et al.  Study of a Eyring–Powell Fluid in a Scraped Surface Heat Exchanger , 2018 .

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  Amir Hussain,et al.  Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study , 2016, IEEE Access.

[17]  Shomona Jacob,et al.  Software defect prediction in large space systems through hybrid feature selection and classification , 2017, Int. Arab J. Inf. Technol..

[18]  Ayse Basar Bener,et al.  Defect prediction from static code features: current results, limitations, new approaches , 2010, Automated Software Engineering.

[19]  M. Lilly Florence,et al.  Software defect prediction techniques using metrics based on neural network classifier , 2018, Cluster Computing.

[20]  Md Zahidul Islam,et al.  Addressing Class Imbalance and Cost Sensitivity in Software Defect Prediction by Combining Domain Costs and Balancing Costs , 2016, ADMA.

[21]  Dursun Delen,et al.  A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets , 2018, Decis. Support Syst..

[22]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[23]  Bin Liu,et al.  Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning , 2017, Inf. Softw. Technol..

[24]  Josephine Sarpong Akosa,et al.  Predictive Accuracy : A Misleading Performance Measure for Highly Imbalanced Data , 2017 .

[25]  Daoxu Chen,et al.  A Cluster Based Feature Selection Method for Cross-Project Software Defect Prediction , 2017, Journal of Computer Science and Technology.