Resampling Methods in Software Quality Classification

In the presence of a number of algorithms for classification and prediction in software engineering, there is a need to have a systematic way of assessing their performances. The performance assessment is typically done by some form of partitioning or resampling of the original data to alleviate biased estimation. For predictive and classification studies in software engineering, there is a lack of a definitive advice on the most appropriate resampling method to use. This is seen as one of the contributing factors for not being able to draw general conclusions on what modeling technique or set of predictor variables are the most appropriate. Furthermore, the use of a variety of resampling methods make it impossible to perform any formal meta-analysis of the primary study results. Therefore, it is desirable to examine the influence of various resampling methods and to quantify possible differences. Objective and method: This study empirically compares five common resampling methods (hold-out validation, repeated random sub-sampling, 10-fold cross-validation, leave-one-out cross-validation and non-parametric bootstrapping) using 8 publicly available data sets with genetic programming (GP) and multiple linear regression (MLR) as software quality classification approaches. Location of (PF, PD) pairs in the ROC (receiver operating characteristics) space and area under an ROC curve (AUC) are used as accuracy indicators. Results: The results show that in terms of the location of (PF, PD) pairs in the ROC space, bootstrapping results are in the preferred region for 3 of the 8 data sets for GP and for 4 of the 8 data sets for MLR. Based on the AUC measure, there are no significant differences between the different resampling methods using GP and MLR. Conclusion: There can be certain data set properties responsible for insignificant differences between the resampling methods based on AUC. These include imbalanced data sets, insignificant predictor variables and high-dimensional data sets. With the current selection of data sets and classification techniques, bootstrapping is a preferred method based on the location of (PF, PD) pair data in the ROC space. Hold-out validation is not a good choice for comparatively smaller data sets, where leave-one-out cross-validation (LOOCV) performs better. For comparatively larger data sets, 10-fold cross-validation performs better than LOOCV.

[1]  Martin J. Shepperd,et al.  Making inferences with small numbers of training sets , 2002, IEE Proc. Softw..

[2]  Per Runeson,et al.  Experience from replicating empirical studies on prediction models , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[3]  Tom A. B. Snijders On Cross-Validation for Predictor Evaluation in Time Series , 1988 .

[4]  Thomas J. Ostrand,et al.  \{PROMISE\} Repository of empirical software engineering data , 2007 .

[5]  Bruce Christianson,et al.  Software defect prediction using static code metrics underestimates defect-proneness , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[6]  Wasif Afzal,et al.  A Comparative Evaluation of Using Genetic Programming for Predicting Fault Count Data , 2008, 2008 The Third International Conference on Software Engineering Advances.

[7]  Andreas S. WeigendAbstract A Bootstrap Evaluation of the Eeect of Data Splitting on Financial Time Series , 1998 .

[8]  Taghi M. Khoshgoftaar,et al.  Classification tree models of software quality over multiple releases , 1999, Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No.PR00443).

[9]  Liang Tian,et al.  Computational Intelligence Methods in Software Reliability Prediction , 2007, Computational Intelligence in Reliability Engineering.

[10]  Bart Baesens,et al.  Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings , 2008, IEEE Transactions on Software Engineering.

[11]  William B. Langdon,et al.  Genetic Programming for Mining DNA Chip Data from Cancer Patients , 2004, Genetic Programming and Evolvable Machines.

[12]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[13]  Peter Nordin,et al.  Using Factorial Experiments to Evaluate the Effect of Genetic Programming Parameters , 2000, EuroGP.

[14]  Yue Jiang,et al.  Fault Prediction using Early Lifecycle Data , 2007, The 18th IEEE International Symposium on Software Reliability (ISSRE '07).

[15]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[16]  S. Dick,et al.  Applying Novel Resampling Strategies To Software Defect Prediction , 2007, NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society.

[17]  Lefteris Angelis,et al.  Comparing cost prediction models by resampling techniques , 2008, J. Syst. Softw..

[18]  Cullen Schaffer,et al.  Selecting a classification method by cross-validation , 1993, Machine Learning.

[19]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[20]  Gregory Levitin,et al.  Computational Intelligence in Reliability Engineering , 2007 .

[21]  Riccardo Poli,et al.  A Field Guide to Genetic Programming , 2008 .

[22]  Blake LeBaron,et al.  A Bootstrap Evaluation of the Effect of Data Splitting on Financial Time Series , 1996, IEEE Trans. Neural Networks.

[23]  R. Tibshirani,et al.  An Introduction to the Bootstrap , 1995 .

[24]  Michael Green,et al.  Comparison of standard resampling methods for performance estimation of artificial neural network ensembles , 2007 .

[25]  Chong Ho Yu,et al.  "Resampling methods: Concepts, Applications, and Justification" , 2002 .

[26]  Arthur K. Kordon,et al.  Pareto front genetic programming parameter selection based on design of experiments and industrial data , 2006, GECCO.

[27]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[28]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[29]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[30]  Wasif Afzal Lessons from applying experimentation in software engineering prediction systems , 2008, APSEC 2008.

[31]  Taghi M. Khoshgoftaar,et al.  Genetic programming model for software quality classification , 2001, Proceedings Sixth IEEE International Symposium on High Assurance Systems Engineering. Special Topic: Impact of Networking.

[32]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[33]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[34]  Emilia Mendes,et al.  Why comparative effort prediction studies may be invalid , 2009, PROMISE '09.

[35]  Cullen Schaffer,et al.  Technical Note: Selecting a Classification Method by Cross-Validation , 1993, Machine Learning.

[36]  Ingunn Myrtveit,et al.  Reliability and validity in comparative studies of software prediction models , 2005, IEEE Transactions on Software Engineering.

[37]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[38]  L. Darrell Whitley,et al.  Using neural networks in reliability prediction , 1992, IEEE Software.

[39]  Wasif Afzal,et al.  On the application of genetic programming for software engineering predictive modeling: A systematic review , 2011, Expert Syst. Appl..

[40]  Glenn Fung,et al.  On the Dangers of Cross-Validation. An Experimental Evaluation , 2008, SDM.

[41]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[42]  Taghi M. Khoshgoftaar,et al.  Classification-tree models of software-quality over multiple releases , 2000, IEEE Trans. Reliab..

[43]  Guilherme Horta Travassos,et al.  Cross versus Within-Company Cost Estimation Studies: A Systematic Review , 2007, IEEE Transactions on Software Engineering.

[44]  Yan Ma,et al.  Adequate and Precise Evaluation of Quality Models in Software Engineering Studies , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[45]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[46]  Brian J. Ross The Effects of Randomly Sampled Training Data on Program Evolution , 2000, GECCO.

[47]  Gail Gong Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression , 1986 .

[48]  Min-Chiang Wang,et al.  Re-sampling procedures for reducing bias of error rate estimation in multinomial classification , 1986 .