Linear and non-linear heterogeneous ensemble methods to predict the number of faults in software systems

This paper expands the use of ensemble methods for the prediction of number of faults unlikely the earlier works on ensemble methods that focused on predicting software modules as faulty or non-faulty.This paper investigates the usage of both heterogeneous ensemble methods as well as homogeneous ensemble methods for the prediction of number of faults.We present two linear combination rules and two non-linear combination rules for combining the outputs of the base learners in the ensemble.In addition, we assess the performance of ensemble methods under two different scenarios, intra-release prediction and inter-releases prediction.The experiments are performed over five open-source software systems with their fifteen releases, collected from the PROMISE data repository. Several classification techniques have been investigated and evaluated earlier for the software fault prediction. These techniques have produced different prediction accuracy for the different software systems and none of the technique has always performed consistently better across different domains. On the other hand, software fault prediction using ensemble methods can be very effective, as they take the advantage of each participating technique for the given dataset and try to come up with better prediction results compared to the individual techniques. Many works are available for classifying software modules being faulty or non-faulty using the ensemble methods. These works are only specifying that whether a given software module is faulty or not, but number of faults in that module are not predicted by them. The use of ensemble methods for the prediction of number of faults has not been explored so far. To fulfill this gap, this paper presents ensemble methods for the prediction of number of faults in the given software modules. The experimental study is designed and conducted for five open-source software projects with their fifteen releases, collected from the PROMISE data repository. The results are evaluated under two different scenarios, intra-release prediction and inter-releases prediction. The prediction accuracy of ensemble methods is evaluated using absolute error, relative error, prediction at level l, and measure of completeness performance measures. Results show that the presented ensemble methods yield improved prediction accuracy over the individual fault prediction techniques under consideration. Further, the results are consistent for all the used datasets. The evidences obtained from the prediction at level l and measure of completeness analysis have also confirmed the effectiveness of the proposed ensemble methods for predicting the number of faults.

[1]  W. Afzal,et al.  prediction of fault count data using genetic programming , 2008, 2008 IEEE International Multitopic Conference.

[2]  Burak Turhan,et al.  Implications of ceiling effects in defect predictors , 2008, PROMISE '08.

[3]  Elaine J. Weyuker,et al.  Where the bugs are , 2004, ISSTA '04.

[4]  Lionel C. Briand,et al.  Empirical Studies of Quality Models in Object-Oriented Systems , 2002, Adv. Comput..

[5]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[6]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[7]  Ayse Basar Bener,et al.  An industrial case study of classifier ensembles for locating software defects , 2011, Software Quality Journal.

[8]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[9]  Raed Shatnawi,et al.  The effectiveness of software metrics in identifying error-prone classes in post-release software evolution process , 2008, J. Syst. Softw..

[10]  Thom Baguley,et al.  Serious stats: a guide to advanced statistics for the behavioral sciences , 2012 .

[11]  R. Tibshirani,et al.  Combining Estimates in Regression and Classification , 1996 .

[12]  Taghi M. Khoshgoftaar,et al.  Count Models for Software Quality Estimation , 2007, IEEE Transactions on Reliability.

[13]  Tin Kam Ho,et al.  MULTIPLE CLASSIFIER COMBINATION: LESSONS AND NEXT STEPS , 2002 .

[14]  Md Zahidul Islam,et al.  Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem , 2015, Inf. Syst..

[15]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[16]  Sandeep Kumar,et al.  Predicting Number of Faults in Software System using Genetic Programming , 2015, SCSE.

[17]  Richard Veryard Economics of Information Systems and Technology , 1991 .

[18]  Taghi M. Khoshgoftaar,et al.  Empirical case studies of combining software quality classification models , 2003, Third International Conference on Quality Software, 2003. Proceedings..

[19]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[20]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[21]  L. Cooper,et al.  When Networks Disagree: Ensemble Methods for Hybrid Neural Networks , 1992 .

[22]  Sandeep Kumar,et al.  A decision tree logic based recommendation system to select software fault prediction techniques , 2017, Computing.

[23]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[24]  Liguo Yu,et al.  Using Negative Binomial Regression Analysis to Predict Software Faults: A Study of Apache Ant , 2012 .

[25]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[26]  Kin Keung Lai,et al.  Credit Risk Analysis Using a Reliability-Based Neural Network Ensemble Model , 2006, ICANN.

[27]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[28]  Taghi M. Khoshgoftaar,et al.  Stability of filter- and wrapper-based software metric selection techniques , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).

[29]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[30]  Taghi M. Khoshgoftaar,et al.  A Comprehensive Empirical Study of Count Models for Software Fault Prediction , 2007, IEEE Transactions on Reliability.

[31]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[32]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[33]  Dennis Child,et al.  The essentials of factor analysis , 1970 .

[34]  Hamoud I. Aljamaan,et al.  An empirical study of bagging and boosting ensembles for identifying faulty classes in object-oriented software , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[35]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[36]  Hans-Jürgen Zimmermann,et al.  Practical Applications of Fuzzy Technologies , 1999 .

[37]  Elaine J. Weyuker,et al.  Looking for bugs in all the right places , 2006, ISSTA '06.

[38]  Ruchika Malhotra,et al.  Comparative analysis of statistical and machine learning methods for predicting faulty modules , 2014, Appl. Soft Comput..

[39]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[40]  Razvan C. Bunescu,et al.  Mapping Bug Reports to Relevant Files: A Ranking Model, a Fine-Grained Benchmark, and Feature Evaluation , 2016, IEEE Transactions on Software Engineering.

[41]  Harvey P. Siy,et al.  Predicting Fault Incidence Using Software Change History , 2000, IEEE Trans. Software Eng..

[42]  Luís Torgo,et al.  SMOTE for Regression , 2013, EPIA.

[43]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[44]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[45]  Bhekisipho Twala,et al.  Predicting Software Faults in Large Space Systems using Machine Learning Techniques , 2011 .

[46]  Irfan Ahmad,et al.  Three empirical studies on predicting software maintainability using ensemble methods , 2015, Soft Comput..

[47]  Yoav Benjamini,et al.  Opening the Box of a Boxplot , 1988 .

[48]  Stephen G. MacDonell Establishing relationships between specification size and software process effort in CASE environments , 1997, Inf. Softw. Technol..

[49]  Mahdi Eftekhari,et al.  A new ensemble learning methodology based on hybridization of classifier ensemble selection approaches , 2015, Appl. Soft Comput..

[50]  Kenneth H. Rosen Discrete mathematics and its applications , 1984 .

[51]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[52]  Renata M. C. R. de Souza,et al.  Zero-inflated prediction model in software-fault data , 2016, IET Softw..

[53]  Adnan Darwiche,et al.  Modeling and Reasoning with Bayesian Networks , 2009 .

[54]  Cristina Marinescu,et al.  How Good Is Genetic Programming at Predicting Changes and Defects? , 2014, 2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[55]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[56]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[57]  Witold Pedrycz,et al.  Identification of defect-prone classes in telecommunication software systems using design metrics , 2006, Inf. Sci..

[58]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[59]  Bojana Dalbelo Basic,et al.  Rotation Forest in Software Defect Prediction , 2015, SQAMIA.

[60]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[61]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[62]  Mahmoud O. Elish,et al.  Empirical comparison of three metrics suites for fault prediction in packages of object-oriented systems: A case study of Eclipse , 2011, Adv. Eng. Softw..

[63]  Elaine J. Weyuker,et al.  Predicting the location and number of faults in large software systems , 2005, IEEE Transactions on Software Engineering.

[64]  Ömer Faruk Arar,et al.  Software defect prediction using cost-sensitive neural network , 2015, Appl. Soft Comput..

[65]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[66]  Bruce Christianson,et al.  The misuse of the NASA metrics data program data sets for automated software defect prediction , 2011, EASE.

[67]  Alípio Mário Jorge,et al.  Ensemble approaches for regression: A survey , 2012, CSUR.

[68]  Richard Veryard The Economics of Information Systems and Software , 1991 .

[69]  Ian Witten,et al.  Data Mining , 2000 .

[70]  Lech Madeyski,et al.  Towards identifying software project clusters with regard to defect prediction , 2010, PROMISE '10.

[71]  Diane Lambert,et al.  Zero-inflacted Poisson regression, with an application to defects in manufacturing , 1992 .

[72]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.