Software defect prediction using tree-based ensembles

Software defect prediction is an active research area in software engineering. Accurate prediction of software defects assists software engineers in guiding software quality assurance activities. In machine learning, ensemble learning has been proven to improve the prediction performance over individual machine learning models. Recently, many Tree-based ensembles have been proposed in the literature, and their prediction capabilities were not investigated in defect prediction. In this paper, we will empirically investigate the prediction performance of seven Tree-based ensembles in defect prediction. Two ensembles are classified as bagging ensembles: Random Forest and Extra Trees, while the other five ensembles are boosting ensembles: Ada boost, Gradient Boosting, Hist Gradient Boosting, XGBoost and CatBoost. The study utilized 11 publicly available MDP NASA software defect datasets. Empirical results indicate the superiority of Tree-based bagging ensembles: Random Forest and Extra Trees ensembles over other Tree-based boosting ensembles. However, none of the investigated Tree-based ensembles was significantly lower than individual decision trees in prediction performance. Finally, Adaboost ensemble was the worst performing ensemble among all Tree-based ensembles.

[1]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[2]  Zhou Xu,et al.  Improving Ranking-Oriented Defect Prediction Using a Cost-Sensitive Ranking SVM , 2020, IEEE Transactions on Reliability.

[3]  Bin Liu,et al.  Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning , 2017, Inf. Softw. Technol..

[4]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[5]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[6]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  S. Kanmani,et al.  Object-oriented software fault prediction using neural networks , 2007, Inf. Softw. Technol..

[9]  Ayse Basar Bener,et al.  Defect prediction from static code features: current results, limitations, new approaches , 2010, Automated Software Engineering.

[10]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[11]  Irfan Ahmad,et al.  Three empirical studies on predicting software maintainability using ensemble methods , 2015, Soft Comput..

[12]  Xiao Liu,et al.  An empirical study on software defect prediction with a simplified metric set , 2014, Inf. Softw. Technol..

[13]  Xiao-Yuan Jing,et al.  Label propagation based semi-supervised learning for software defect prediction , 2016, Automated Software Engineering.

[14]  Nguyen Thanh Binh,et al.  Combining feature selection, feature learning and ensemble learning for software fault prediction , 2019, 2019 11th International Conference on Knowledge and Systems Engineering (KSE).

[15]  Qinbao Song,et al.  Data Quality: Some Comments on the NASA Software Defect Datasets , 2013, IEEE Transactions on Software Engineering.

[16]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[17]  Lipika Goel,et al.  Defect Prediction of Cross Projects Using PCA and Ensemble Learning Approach , 2020 .

[18]  Tianrui Li,et al.  Ensembles based combined learning for improved software fault prediction: A comparative study , 2017, 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE).

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  C. Borror Nonparametric Statistical Methods, 2nd, Ed. , 2001 .

[21]  Cong Jin,et al.  Prediction approach of software fault-proneness based on hybrid artificial neural network and quantum particle swarm optimization , 2015, Appl. Soft Comput..

[22]  Hui Liu,et al.  Software Defect Prediction Based on Ensemble Learning , 2019, DSIT.

[23]  Mohammad Alshayeb,et al.  Software defect prediction using ensemble learning on selected features , 2015, Inf. Softw. Technol..

[24]  Aleksei Guryanov,et al.  Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees , 2019, AIST.

[25]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[26]  Mary E. Helander,et al.  Early Risk-Management by Identification of Fault-prone Modules , 2004, Empirical Software Engineering.

[27]  Bruce Christianson,et al.  Building an Ensemble for Software Defect Prediction Based on Diversity Selection , 2016, ESEM.

[28]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[29]  Olcay Taner Yildiz,et al.  Software defect prediction using Bayesian networks , 2012, Empirical Software Engineering.

[30]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[31]  Anna Veronika Dorogush,et al.  CatBoost: gradient boosting with categorical features support , 2018, ArXiv.

[32]  Xiang Chen,et al.  Improving defect prediction with deep forest , 2019, Inf. Softw. Technol..

[33]  Wenjia Wang,et al.  Determining appropriate approaches for using data in feature selection , 2017, Int. J. Mach. Learn. Cybern..

[34]  Hossam Faris,et al.  Software Defect Prediction Using Heterogeneous Ensemble Classification Based on Segmented Patterns , 2020, Applied Sciences.

[35]  Ruchika Malhotra,et al.  A systematic review of machine learning techniques for software fault prediction , 2015, Appl. Soft Comput..

[36]  Ning Li,et al.  A Systematic Review of Unsupervised Learning Techniques for Software Defect Prediction , 2019, Inf. Softw. Technol..

[37]  Arif Ali Khan,et al.  Performance Evaluation of Ensemble Methods For Software Fault Prediction: An Experiment , 2015, ASWEC.

[38]  Ahmed Ali Abdalla Esmin,et al.  Applying Swarm Ensemble Clustering Technique for Fault Prediction Using Software Metrics , 2014, 2014 13th International Conference on Machine Learning and Applications.

[39]  Anil Kumar Tripathi,et al.  BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques , 2020, Expert Syst. Appl..

[40]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[41]  Shane McIntosh,et al.  Automated Parameter Optimization of Classification Techniques for Defect Prediction Models , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[42]  Ying Zou,et al.  Cross-Project Defect Prediction Using a Connectivity-Based Unsupervised Classifier , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[43]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[44]  Hamoud I. Aljamaan,et al.  An empirical study of bagging and boosting ensembles for identifying faulty classes in object-oriented software , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[45]  Bojan Cukic,et al.  Software defect prediction using semi-supervised learning with dimension reduction , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.