Performance analysis of ensemble learning for predicting defects in open source software

Machine learning techniques have been earnestly explored by many software engineering researchers. At present state of art, there is no conclusive evidence on the kind of machine learning techniques which are most accurate and efficient for software defect prediction but some recent studies suggest that combining multiple machine learners, that is, ensemble learning, may be a more accurate alternative. This study contributes to software defect prediction literature by systematically evaluating the predictive accuracy of three well known homogeneous ensemble methods - Bagging, Boosting, and Rotation Forest, utilizing fifteen important underlying base learners, by exploiting the data of nine open source object-oriented systems obtained from the PROMISE repository. Results indicate while Bagging and Boosting may result in AUC performance loss, AUC performance improvement results in twelve of the fifteen investigated base learners with Rotation Forest ensemble.

[1]  S. Kanmani,et al.  Object-oriented software fault prediction using neural networks , 2007, Inf. Softw. Technol..

[2]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[5]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Olcay Taner Yildiz,et al.  Software defect prediction using Bayesian networks , 2012, Empirical Software Engineering.

[7]  Vadlamani Ravi,et al.  Hybrid intelligent systems for predicting software reliability , 2013, Appl. Soft Comput..

[8]  Bart Baesens,et al.  An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models , 2011, Decis. Support Syst..

[9]  Taghi M. Khoshgoftaar,et al.  An application of fuzzy clustering to software quality prediction , 2000, Proceedings 3rd IEEE Symposium on Application-Specific Systems and Software Engineering Technology.

[10]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[11]  Ayse Basar Bener,et al.  An industrial case study of classifier ensembles for locating software defects , 2011, Software Quality Journal.

[12]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[13]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[14]  Ayse Basar Bener,et al.  Ensemble of software defect predictors: a case study , 2008, ESEM '08.

[15]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[16]  Taghi M. Khoshgoftaar,et al.  An empirical study of predicting software faults with case-based reasoning , 2006, Software Quality Journal.

[17]  Taghi M. Khoshgoftaar,et al.  Software quality assessment using a multi-strategy classifier , 2014, Inf. Sci..

[18]  Banu Diri,et al.  A systematic review of software fault prediction studies , 2009, Expert Syst. Appl..

[19]  Li Ming,et al.  Software Defect Prediction: Software Defect Prediction , 2008 .

[20]  Marian Jureczko,et al.  Using Object-Oriented Design Metrics to Predict Software Defects 1* , 2010 .

[21]  Karim O. Elish,et al.  Predicting defect-prone software modules using support vector machines , 2008, J. Syst. Softw..

[22]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[23]  Tibor Gyimóthy,et al.  Empirical validation of object-oriented metrics on open source software for fault prediction , 2005, IEEE Transactions on Software Engineering.

[24]  Taghi M. Khoshgoftaar,et al.  Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.