Towards an ensemble based system for predicting the number of software faults

Paper presents ensemble based system for the prediction of number of software faults.System is based on the heterogeneous ensemble method.System uses three fault prediction techniques as base learners for the ensemble.Results are verified on Eclipse datasets. Software fault prediction using different techniques has been done by various researchers previously. It is observed that the performance of these techniques varied from dataset to dataset, which make them inconsistent for fault prediction in the unknown software project. On the other hand, use of ensemble method for software fault prediction can be very effective, as it takes the advantage of different techniques for the given dataset to come up with better prediction results compared to individual technique. Many works are available on binary class software fault prediction (faulty or non-faulty prediction) using ensemble methods, but the use of ensemble methods for the prediction of number of faults has not been explored so far. The objective of this work is to present a system using the ensemble of various learning techniques for predicting the number of faults in given software modules. We present a heterogeneous ensemble method for the prediction of number of faults and use a linear combination rule and a non-linear combination rule based approaches for the ensemble. The study is designed and conducted for different software fault datasets accumulated from the publicly available data repositories. The results indicate that the presented system predicted number of faults with higher accuracy. The results are consistent across all the datasets. We also use prediction at level l (Pred(l)), and measure of completeness to evaluate the results. Pred(l) shows the number of modules in a dataset for which average relative error value is less than or equal to a threshold value l. The results of prediction at level l analysis and measure of completeness analysis have also confirmed the effectiveness of the presented system for the prediction of number of faults. Compared to the single fault prediction technique, ensemble methods produced improved performance for the prediction of number of software faults. Main impact of this work is to allow better utilization of testing resources helping in early and quick identification of most of the faults in the software system.

[1]  Elaine J. Weyuker,et al.  Where the bugs are , 2004, ISSTA '04.

[2]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[3]  Xiaoyuan Jing,et al.  Multiple kernel ensemble learning for software defect prediction , 2015, Automated Software Engineering.

[4]  Sandeep Kumar,et al.  Predicting Number of Faults in Software System using Genetic Programming , 2015, SCSE.

[5]  A.E. Hassan,et al.  The road ahead for Mining Software Repositories , 2008, 2008 Frontiers of Software Maintenance.

[6]  Raed Shatnawi,et al.  The effectiveness of software metrics in identifying error-prone classes in post-release software evolution process , 2008, J. Syst. Softw..

[7]  Yutao Ma,et al.  An empirical study on predicting defect numbers , 2015, SEKE.

[8]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[9]  Taghi M. Khoshgoftaar,et al.  Empirical case studies of combining software quality classification models , 2003, Third International Conference on Quality Software, 2003. Proceedings..

[10]  Michele Lanza,et al.  An extensive comparison of bug prediction approaches , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[11]  Dong Zhou,et al.  Translation techniques in cross-language information retrieval , 2012, CSUR.

[12]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[13]  Taghi M. Khoshgoftaar,et al.  Feature Selection with Imbalanced Data for Software Defect Prediction , 2009, 2009 International Conference on Machine Learning and Applications.

[14]  Sandeep Kumar,et al.  A decision tree logic based recommendation system to select software fault prediction techniques , 2017, Computing.

[15]  Sandeep Kumar,et al.  An empirical study of some software fault prediction techniques for the number of faults prediction , 2017, Soft Comput..

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  Charles Yang,et al.  Partition testing, stratified sampling, and cluster analysis , 1993, SIGSOFT '93.

[18]  Ian Witten,et al.  Data Mining , 2000 .

[19]  Diane Lambert,et al.  Zero-inflacted Poisson regression, with an application to defects in manufacturing , 1992 .

[20]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[21]  Thom Baguley,et al.  Serious stats: a guide to advanced statistics for the behavioral sciences , 2012 .

[22]  Natalia Juristo Juzgado,et al.  Basics of Software Engineering Experimentation , 2010, Springer US.

[23]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[24]  Witold Pedrycz,et al.  Identification of defect-prone classes in telecommunication software systems using design metrics , 2006, Inf. Sci..

[25]  S. Kanmani,et al.  Object-oriented software fault prediction using neural networks , 2007, Inf. Softw. Technol..

[26]  Harvey P. Siy,et al.  Predicting Fault Incidence Using Software Change History , 2000, IEEE Trans. Software Eng..

[27]  Taghi M. Khoshgoftaar,et al.  Stability of filter- and wrapper-based software metric selection techniques , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).

[28]  Shihai Wang,et al.  An Empirical Study for Software Fault-Proneness Prediction with Ensemble Learning Models on Imbalanced Data Sets , 2014, J. Softw..

[29]  Taghi M. Khoshgoftaar,et al.  A Comprehensive Empirical Study of Count Models for Software Fault Prediction , 2007, IEEE Transactions on Reliability.

[30]  Sashank Dara,et al.  Online Defect Prediction for Imbalanced Data , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[31]  Luís Torgo,et al.  SMOTE for Regression , 2013, EPIA.

[32]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[33]  Sandeep Kumar,et al.  Linear and non-linear heterogeneous ensemble methods to predict the number of faults in software systems , 2017, Knowl. Based Syst..

[34]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[36]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[37]  V. R. Sarma Dhulipala,et al.  The Study and Analysis of Classification Algorithm for Animal Kingdom Dataset , 2013 .

[38]  R. Shatnawi Improving software fault-prediction for imbalanced data , 2012, 2012 International Conference on Innovations in Information Technology (IIT).

[39]  Yuming Zhou,et al.  Empirical analysis of network measures for effort-aware fault-proneness prediction , 2016, Inf. Softw. Technol..

[40]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[41]  Xiao Liu,et al.  An empirical study on software defect prediction with a simplified metric set , 2014, Inf. Softw. Technol..

[42]  Jun Zheng,et al.  Cost-sensitive boosting neural networks for software defect prediction , 2010, Expert Syst. Appl..

[43]  Alípio Mário Jorge,et al.  Ensemble approaches for regression: A survey , 2012, CSUR.

[44]  W. Afzal,et al.  prediction of fault count data using genetic programming , 2008, 2008 IEEE International Multitopic Conference.

[45]  Ayse Basar Bener,et al.  An industrial case study of classifier ensembles for locating software defects , 2011, Software Quality Journal.

[46]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[47]  S. Dick,et al.  Applying Novel Resampling Strategies To Software Defect Prediction , 2007, NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society.

[48]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[49]  Mahendra Tiwari,et al.  Performance analysis of Data Mining algorithms in Weka , 2012 .

[50]  Yoav Benjamini,et al.  Opening the Box of a Boxplot , 1988 .

[51]  Stephen G. MacDonell Establishing relationships between specification size and software process effort in CASE environments , 1997, Inf. Softw. Technol..

[52]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[53]  Taghi M. Khoshgoftaar,et al.  Count Models for Software Quality Estimation , 2007, IEEE Transactions on Reliability.

[54]  Liguo Yu,et al.  Using Negative Binomial Regression Analysis to Predict Software Faults: A Study of Apache Ant , 2012 .

[55]  H. E. Dunsmore,et al.  Software engineering metrics and models , 1986 .

[56]  Cristina Marinescu,et al.  How Good Is Genetic Programming at Predicting Changes and Defects? , 2014, 2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.

[57]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[58]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[59]  Bhekisipho Twala,et al.  Predicting Software Faults in Large Space Systems using Machine Learning Techniques , 2011 .

[60]  Irfan Ahmad,et al.  Three empirical studies on predicting software maintainability using ensemble methods , 2015, Soft Comput..

[61]  Lionel C. Briand,et al.  Empirical Studies of Quality Models in Object-Oriented Systems , 2002, Adv. Comput..

[62]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[63]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[64]  Jonathan I. Maletic,et al.  Mining software repositories for traceability links , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).

[65]  Bruce Christianson,et al.  The misuse of the NASA metrics data program data sets for automated software defect prediction , 2011, EASE.

[66]  Stephen M. Stigler,et al.  The History of Statistics: The Measurement of Uncertainty before 1900 , 1986 .

[67]  Hamoud I. Aljamaan,et al.  An empirical study of bagging and boosting ensembles for identifying faulty classes in object-oriented software , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[68]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[69]  Sara Silva,et al.  GPLAB A Genetic Programming Toolbox for MATLAB , 2004 .

[70]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[71]  Ayse Basar Bener,et al.  Defect prediction from static code features: current results, limitations, new approaches , 2010, Automated Software Engineering.

[72]  Mahmoud O. Elish,et al.  Empirical comparison of three metrics suites for fault prediction in packages of object-oriented systems: A case study of Eclipse , 2011, Adv. Eng. Softw..

[73]  Elaine J. Weyuker,et al.  Predicting the location and number of faults in large software systems , 2005, IEEE Transactions on Software Engineering.

[74]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[75]  Akito Monden,et al.  An analysis of developer metrics for fault prediction , 2010, PROMISE '10.

[76]  Dennis Child,et al.  The essentials of factor analysis , 1970 .

[77]  Elaine J. Weyuker,et al.  Looking for bugs in all the right places , 2006, ISSTA '06.