An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction

Software systems are now ubiquitous and are used every day for automation purposes in personal and enterprise applications; they are also essential to many safety-critical and mission-critical systems, e.g., air traffic control systems, autonomous cars, and SCADA systems. With the availability of massive storage capabilities, high speed Internet, and the advent of Internet of Things devices, modern software systems are growing in both size and complexity. Maintaining a high quality of such complex systems while manually keeping the error rate at a minimum is a challenge. Therefore, automated detection of faulty components in a software system is important during software development and also post-delivery. Fault detection models usually needs to be trained on a labeled-balanced dataset with both faulty and non-faulty samples. Earlier work, e.g. Mohsin et al. (2016), showed that most real fault detection training dataset are imbalanced. Thereby, the trained model gets over-fitted and classifies faulty components as non-faulty components. The consequence of a high false negative rate is cumulative and results in generating more errors when using the model in other software systems –never seen before, which is very expensive. In this paper, we propose a software defect prediction ensemble model which considers the class imbalance problem in real software datasets. We use different oversampling techniques to build an ensemble classifier that can reduce the effect of low minority samples in the defective data. The proposed approach is verified using PROMISE software engineering datasets. The results show that our ensemble oversampling technique can more greatly reduce the false negative rate compared to the standard classification techniques and identify the faulty components more accurately resulting in a less expensive detection system (lowering the rate of non-faulty predictions of faulty modules).

[1]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[2]  Samia Boukir,et al.  Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin , 2015 .

[3]  Yaping Lin,et al.  Synthetic minority oversampling technique for multiclass imbalance problems , 2017, Pattern Recognit..

[4]  Bhekisipho Twala,et al.  Predicting Software Faults in Large Space Systems using Machine Learning Techniques , 2011 .

[5]  John Yearwood,et al.  A Hybrid Feature Selection With Ensemble Classification for Imbalanced Healthcare Data: A Case Study for Brain Tumor Diagnosis , 2016, IEEE Access.

[6]  Jun Zhang,et al.  Fuzzy-Based Information Decomposition for Incomplete and Imbalanced Data Learning , 2017, IEEE Transactions on Fuzzy Systems.

[7]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[8]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Musa A. Mammadov,et al.  A hybrid wrapper-filter approach to detect the source(s) of out-of-control signals in multivariate manufacturing process , 2014, Eur. J. Oper. Res..

[10]  Xin Yao,et al.  Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Ömer Faruk Arar,et al.  Software defect prediction using cost-sensitive neural network , 2015, Appl. Soft Comput..

[12]  Ming Zhao,et al.  A comparison between software design and code metrics for the prediction of software fault content , 1998, Inf. Softw. Technol..

[13]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[14]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[15]  Burak Turhan,et al.  Implications of ceiling effects in defect predictors , 2008, PROMISE '08.

[16]  Jun Zhang,et al.  Fuzzy-Based Feature and Instance Recovery , 2016, ACIIDS.

[17]  Banu Diri,et al.  Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem , 2009, Inf. Sci..

[18]  Zhaowei Shang,et al.  Tackling class overlap and imbalance problems in software defect prediction , 2018, Software Quality Journal.

[19]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[20]  S. Dick,et al.  Applying Novel Resampling Strategies To Software Defect Prediction , 2007, NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society.

[21]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[22]  Yue Xu,et al.  Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets , 2018, Inf. Sci..

[23]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[24]  Mohamed Bader-El-Den,et al.  Early hospital mortality prediction of intensive care unit patients using an ensemble learning approach , 2017, Int. J. Medical Informatics.

[25]  Michael A. King,et al.  Ensemble methods for advanced skier days prediction , 2014, Expert Syst. Appl..

[26]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[27]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[28]  Patrick van der Smagt,et al.  Introduction to neural networks , 1995, The Lancet.

[29]  John Yearwood,et al.  A parallel framework for software defect detection and metric selection on cloud computing , 2017, Cluster Computing.

[30]  Xiao Wang,et al.  Classification by evolutionary ensembles , 2006, Pattern Recognit..

[31]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.