Improved software defect prediction using Pruned Histogram-based isolation forest

Abstract Software defect prediction (SDP) is a hot topic in the modern software engineering research community. It has been used for evaluating software quality and reliability and allocating limited testing resources effectively. Based on analyzing the software source code and development process and extracting critical metrics, many data mining and machine learning methods have been used for SDP. However, these existing learning methods have difficulty with handling the imbalanced data distribution of accumulated training dataset. Isolation forest, an anomaly detection method based on the ensemble learning, has been studied to deal with the imbalanced data distribution issue for obtaining high prediction performance. However, the isolation forest method suffers from a main drawback of slow convergence, which is caused by selecting the feature value at random during the process of building isolation trees. To conquer this problem, in this paper histogram is constructed for the value set of selected isolation feature helping identify feature values preferable to build isolation trees. Motivated by the “many could be better than all” principle in the ensemble learning, the ensemble pruning strategy is further employed to optimize the obtained isolation forest, leading to a novel SDP method named PHIForest (Pruned Histogram-based Isolation Forest) in this work. The proposed method can provide fast convergence through the histogram-based splitting feature value selection, and decrease the ensemble scale and improve prediction performance through the ensemble pruning. Comprehensive experiments conducted on ten real datasets are performed to demonstrate effectiveness of the proposed SDP method.

[1]  Muhammed Maruf Öztürk,et al.  Which type of metrics are useful to deal with class imbalance in software defect prediction? , 2017, Inf. Softw. Technol..

[2]  McCarthyEd,et al.  A Unified Approach , 2005 .

[3]  Tracy Hall,et al.  A Systematic Literature Review on Fault Prediction Performance in Software Engineering , 2012, IEEE Transactions on Software Engineering.

[4]  Yong Wang,et al.  Predicting Bugs in Software Code Changes Using Isolation Forest , 2017, 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[5]  Bin Liu,et al.  Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning , 2017, Inf. Softw. Technol..

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Dilip Kumar Yadav,et al.  A fuzzy logic based approach for phase-wise software defects prediction using software metrics , 2015, Inf. Softw. Technol..

[8]  Xiang Chen,et al.  Software defect number prediction: Unsupervised vs supervised methods , 2019, Inf. Softw. Technol..

[9]  Piotr Duda,et al.  How to adjust an ensemble size in stream data mining? , 2017, Inf. Sci..

[10]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[11]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[12]  Baowen Xu,et al.  An Improved SDA Based Defect Prediction Framework for Both Within-Project and Cross-Project Class-Imbalance Problems , 2017, IEEE Transactions on Software Engineering.

[13]  Minrui Fei,et al.  An online anomaly detection method for stream data using isolation principle and statistic histogram , 2015, Int. J. Model. Simul. Sci. Comput..

[14]  Tao Zhang,et al.  Software defect prediction based on kernel PCA and weighted extreme learning machine , 2019, Inf. Softw. Technol..

[15]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[16]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[17]  Hyun Gook Kang,et al.  Exhaustive testing of safety-critical software for reactor protection system , 2020, Reliab. Eng. Syst. Saf..

[18]  R. Anitha,et al.  Malware detection by pruning of parallel ensembles using harmony search , 2013, Pattern Recognit. Lett..

[19]  Md Zahidul Islam,et al.  Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem , 2015, Inf. Syst..

[20]  Xiaohong Su,et al.  An Empirical Study on Software Defect Prediction Using Over-Sampling by SMOTE , 2018, Int. J. Softw. Eng. Knowl. Eng..

[21]  Xiaoyuan Jing,et al.  Multiple kernel ensemble learning for software defect prediction , 2015, Automated Software Engineering.

[22]  Hoh Peter In,et al.  Developer Micro Interaction Metrics for Software Defect Prediction , 2016, IEEE Transactions on Software Engineering.

[23]  Xiao-Yuan Jing,et al.  Cross-Project and Within-Project Semisupervised Software Defect Prediction: A Unified Approach , 2018, IEEE Transactions on Reliability.

[24]  Hao Huang,et al.  CLOVER: a faster prior-free approach to rare-category detection , 2012, Knowledge and Information Systems.

[25]  Akito Monden,et al.  MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction , 2018, IEEE Transactions on Software Engineering.

[26]  David P. Helmbold,et al.  Boosting Methods for Regression , 2002, Machine Learning.

[27]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[28]  Ce Zhang,et al.  Software reliability prediction using a deep learning model based on the RNN encoder-decoder , 2018, Reliab. Eng. Syst. Saf..

[29]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[30]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[31]  Yutao Ma,et al.  An Empirical Study of Ranking-Oriented Cross-Project Software Defect Prediction , 2016, Int. J. Softw. Eng. Knowl. Eng..

[32]  Daoqiang Zhang,et al.  Two-Stage Cost-Sensitive Learning for Software Defect Prediction , 2014, IEEE Transactions on Reliability.

[33]  Qinbao Song,et al.  Using Coding-Based Ensemble Learning to Improve Software Defect Prediction , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[34]  Mohammadhossein Heydari,et al.  Robust allocation of testing resources in reliability growth , 2017, Reliab. Eng. Syst. Saf..

[35]  Tracy Hall,et al.  Reproducibility and replicability of software defect prediction studies , 2018, Inf. Softw. Technol..

[36]  Xin Yao,et al.  A Learning-to-Rank Approach to Software Defect Prediction , 2015, IEEE Transactions on Reliability.

[37]  Yves Rozenholc,et al.  How many bins should be put in a regular histogram , 2006 .