An empirical study toward dealing with noise and class imbalance issues in software defect prediction

The quality of the defect datasets is a critical issue in the domain of software defect prediction (SDP). These datasets are obtained through the mining of software repositories. Recent studies claim over the quality of the defect dataset. It is because of inconsistency between bug/clean fix keyword in fault reports and the corresponding link in the change management logs. Class Imbalance (CI) problem is also a big challenging issue in SDP models. The defect prediction method trained using noisy and imbalanced data leads to inconsistent and unsatisfactory results. Combined analysis over noisy instances and CI problem needs to be required. To the best of our knowledge, there are insufficient studies that have been done over such aspects. In this paper, we deal with the impact of noise and CI problem on five baseline SDP models; we manually added the various noise level (0–80%) and identified its impact on the performance of those SDP models. Moreover, we further provide guidelines for the possible range of tolerable noise for baseline models. We have also suggested the SDP model, which has the highest noise tolerable ability and outperforms over other classical methods. The True Positive Rate (TPR) and False Positive Rate (FPR) values of the baseline models reduce between 20–30% after adding 10–40% noisy instances. Similarly, the ROC (Receiver Operating Characteristics) values of SDP models reduce to 40–50%. The suggested model leads to avoid noise between 40–60% as compared to other traditional models.

[1]  Bin Liu,et al.  Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning , 2017, Inf. Softw. Technol..

[2]  Oral Alan,et al.  Class noise detection based on software metrics and ROC curves , 2011, Inf. Sci..

[3]  Osmar R. Zaïane,et al.  Synthetic Oversampling with the Majority Class: A New Perspective on Handling Extreme Imbalance , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[4]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[7]  Ali Selamat,et al.  An empirical study based on semi-supervised hybrid self-organizing map for software fault prediction , 2015, Knowl. Based Syst..

[8]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[9]  Xiang Chen,et al.  Improving defect prediction with deep forest , 2019, Inf. Softw. Technol..

[10]  Hideaki Hata,et al.  Cross project defect prediction using class distribution estimation and oversampling , 2018, Inf. Softw. Technol..

[11]  Richard Torkar,et al.  Software fault prediction metrics: A systematic literature review , 2013, Inf. Softw. Technol..

[12]  Nitesh V. Chawla,et al.  Information Gain, Correlation and Support Vector Machines , 2006, Feature Extraction.

[13]  N A Obuchowski,et al.  Nonparametric analysis of clustered ROC curve data. , 1997, Biometrics.

[14]  Andreas Zeller,et al.  Mining metrics to predict component failures , 2006, ICSE.

[15]  Neeraj Bhargava,et al.  Decision Tree Analysis on J48 Algorithm for Data Mining , 2013 .

[16]  R. M. Chandrasekaran,et al.  A taxonomy on impact of label noise and feature noise using machine learning techniques , 2019, Soft Computing.

[17]  Yourong Li,et al.  Short-term fault prediction based on support vector machines with parameter optimization by evolution strategy , 2009, Expert Syst. Appl..

[18]  Witold Pedrycz,et al.  A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[19]  Taghi M. Khoshgoftaar,et al.  Software quality estimation with limited fault data: a semi-supervised learning perspective , 2007, Software Quality Journal.

[20]  Premkumar T. Devanbu,et al.  The missing links: bugs and bug-fix commits , 2010, FSE '10.

[21]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[22]  Stephen H. Kan,et al.  Metrics and Models in Software Quality Engineering , 1994, SOEN.

[23]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[24]  Licheng Jiao,et al.  Rough Noise-Filtered Easy Ensemble for Software Fault Prediction , 2018, IEEE Access.

[25]  Anil Kumar Tripathi,et al.  BCV-Predictor: A bug count vector predictor of a successive version of the software system , 2020, Knowl. Based Syst..

[26]  Krishan Kumar,et al.  Noise Filtering and Imbalance Class Distribution Removal for Optimizing Software Fault Prediction using Best Software Metrics Suite , 2020, 2020 5th International Conference on Communication and Electronics Systems (ICCES).

[27]  Thomas Zimmermann,et al.  Automatic Identification of Bug-Introducing Changes , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[28]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[29]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[30]  Taghi M. Khoshgoftaar,et al.  An empirical study of the classification performance of learners on imbalanced and noisy software quality data , 2014, Inf. Sci..

[31]  Yaoqi Zhou,et al.  Achieving 80% ten‐fold cross‐validated accuracy for secondary structure prediction by large‐scale training , 2006, Proteins.

[32]  Banu Diri,et al.  Metrics-Driven Software Quality Prediction Without Prior Fault Data , 2010 .

[33]  Sushant Kumar Pandey,et al.  Software Bug Prediction Prototype Using Bayesian Network Classifier: A Comprehensive Model , 2018 .

[34]  RadjenovićDanijel,et al.  Software fault prediction metrics , 2013 .

[35]  Wei Hu,et al.  AdaBoost-Based Algorithm for Network Intrusion Detection , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[36]  Kurt R. Linberg Software developer perceptions about software project failure: a case study , 1999, J. Syst. Softw..

[37]  Changzhen Hu,et al.  Establishing a software defect prediction model via effective dimension reduction , 2019, Inf. Sci..

[38]  Tim Menzies,et al.  Better cross company defect prediction , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[39]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[40]  The application of ROC analysis in threshold identification, data imbalance and metrics selection for software fault prediction , 2017, Innovations in Systems and Software Engineering.

[41]  Xindong Wu Knowledge Acquisition from Databases , 1995 .

[42]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[43]  Qinbao Song,et al.  A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[44]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[45]  Sushant Kumar Pandey,et al.  Software defect prediction using K-PCA and various kernel-based extreme learning machine: an empirical study , 2020, IET Softw..

[46]  Mohammad Alshayeb,et al.  Software defect prediction using ensemble learning on selected features , 2015, Inf. Softw. Technol..

[47]  Zhi-Hua Zhou,et al.  Sample-based software defect prediction with active and semi-supervised learning , 2012, Automated Software Engineering.

[48]  Sinno Jialin Pan,et al.  Transfer defect learning , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[49]  Banu Diri,et al.  Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm , 2011, Expert Syst. Appl..

[50]  Yi Zhang,et al.  Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[51]  Kenneth L. Clarkson,et al.  Applications of random sampling in computational geometry, II , 1989, Discret. Comput. Geom..

[52]  Anil Kumar Tripathi,et al.  Machine learning based methods for software fault prediction: A survey , 2021, Expert Syst. Appl..

[53]  Gunnar Rätsch,et al.  An Improvement of AdaBoost to Avoid Overfitting , 1998, ICONIP.

[54]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[55]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[56]  I. Maqsood,et al.  Random Forests and Decision Trees , 2012 .

[57]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[58]  Jian Li,et al.  Software Defect Prediction via Convolutional Neural Network , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[59]  Francisco Charte,et al.  Addressing imbalance in multilabel classification: Measures and random resampling algorithms , 2015, Neurocomputing.

[60]  Premkumar T. Devanbu,et al.  Fair and balanced?: bias in bug-fix datasets , 2009, ESEC/FSE '09.

[61]  Licheng Jiao,et al.  Rough Noise-Filtered Easy Ensemble for Software Fault Prediction , 2018 .

[62]  Kenneth L. Clarkson,et al.  Applications of random sampling in computational geometry, II , 1988, SCG '88.

[63]  Bart Baesens,et al.  Evaluating software defect prediction performance: an updated benchmarking study , 2019, SSRN Electronic Journal.

[64]  Premkumar T. Devanbu,et al.  How, and why, process metrics are better , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[65]  Bojan Cukic,et al.  Software defect prediction using semi-supervised learning with dimension reduction , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[66]  Rongxin Wu,et al.  Dealing with noise in defect prediction , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[67]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[68]  Krzysztof Czarnecki,et al.  Towards predicting feature defects in software product lines , 2016, FOSD@SPLASH.

[69]  Song Huang,et al.  A new weighted naive Bayes method based on information diffusion for software defect prediction , 2019, Software Quality Journal.

[70]  Oral Alan,et al.  Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets , 2011, Expert Syst. Appl..

[71]  David Lo,et al.  The Impact of Mislabeled Changes by SZZ on Just-in-Time Defect Prediction , 2019, IEEE Transactions on Software Engineering.

[72]  Lei Li,et al.  Naive Bayes classification algorithm based on small sample set , 2011, 2011 IEEE International Conference on Cloud Computing and Intelligence Systems.

[73]  Banu Diri,et al.  Software defect prediction using artificial immune recognition system , 2007 .

[74]  Cagatay Catal,et al.  A Comparison of Semi-Supervised Classification Approaches for Software Defect Prediction , 2014, J. Intell. Syst..

[75]  Akito Monden,et al.  On the relative value of data resampling approaches for software defect prediction , 2018, Empirical Software Engineering.

[76]  Zhaowei Shang,et al.  Tackling class overlap and imbalance problems in software defect prediction , 2018, Software Quality Journal.

[77]  Wei-Tek Tsai,et al.  An experimental study of fault detection in user requirements documents , 1992, TSEM.

[78]  Banu Diri,et al.  An Artificial Immune System Approach for Fault Prediction in Object-Oriented Software , 2007, 2nd International Conference on Dependability of Computer Systems (DepCoS-RELCOMEX '07).

[79]  Panayiotis E. Pintelas,et al.  Mixture of Expert Agents for Handling Imbalanced Data Sets , 2003 .

[80]  Xinli Yang,et al.  TLEL: A two-layer ensemble learning approach for just-in-time defect prediction , 2017, Inf. Softw. Technol..

[81]  A. Jefferson Offutt,et al.  Investigations of the software testing coupling effect , 1992, TSEM.

[82]  Sashank Dara,et al.  Online Defect Prediction for Imbalanced Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[83]  Emad Shihab,et al.  An Exploration of Challenges Limiting Pragmatic Software Defect Prediction , 2012 .

[84]  Ken-ichi Matsumoto,et al.  The Impact of Mislabelling on the Performance and Interpretation of Defect Prediction Models , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[85]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[86]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[87]  Leandro L. Minku,et al.  Class Imbalance Evolution and Verification Latency in Just-in-Time Software Defect Prediction , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[88]  Yi Peng,et al.  Ensemble of Software Defect Predictors: an AHP-Based Evaluation Method , 2011, Int. J. Inf. Technol. Decis. Mak..

[89]  Rudolf Ramler,et al.  Noise in Bug Report Data and the Impact on Defect Prediction Results , 2013, 2013 Joint Conference of the 23rd International Workshop on Software Measurement and the 8th International Conference on Software Process and Product Measurement.

[90]  Song Wang,et al.  Automatically Learning Semantic Features for Defect Prediction , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[91]  D. Gerstorf,et al.  Aging, Disablement, and Dying: Using Time-as-Process and Time-as-Resources Metrics to Chart Late-Life Change , 2010, Research in human development.

[92]  Anil Kumar Tripathi,et al.  BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques , 2020, Expert Syst. Appl..

[93]  Taghi M. Khoshgoftaar,et al.  Reducing overfitting in genetic programming models for software quality classification , 2004, Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings..

[94]  M. Longnecker,et al.  A modified Wilcoxon rank sum test for paired data , 1983 .

[95]  A. Kaur,et al.  Application of Random Forest in Predicting Fault-Prone Classes , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[96]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.