Computational Statistics and Data Analysis Measuring the Prediction Error. a Comparison of Cross-validation, Bootstrap and Covariance Penalty Methods

The estimators most widely used to evaluate the prediction error of a non-linear regression model are examined. An extensive simulation approach allowed the comparison of the performance of these estimators for different non-parametric methods, and with varying signal-to-noise ratio and sample size. Estimators based on resampling methods such as Leave-one-out, parametric and non-parametric Bootstrap, as well as repeated Cross Validation methods and Hold-out, were considered. The methods used are Regression Trees, Projection Pursuit Regression and Neural Networks. The repeated-corrected 10-fold Cross-Validation estimator and the Parametric Bootstrap estimator obtained the best performance in the simulations.

[1]  Jianming Ye On Measuring and Correcting the Effects of Data Mining and Model Selection , 1998 .

[2]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[3]  B. Efron How Biased is the Apparent Error Rate of a Prediction Rule , 1986 .

[4]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[5]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[6]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[7]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[8]  Ping Zhang Model Selection Via Multifold Cross Validation , 1993 .

[9]  L. Breiman The Little Bootstrap and other Methods for Dimensionality Selection in Regression: X-Fixed Prediction Error , 1992 .

[10]  S. Dudoit,et al.  Asymptotics of cross-validated risk estimation in estimator selection and performance assessment , 2005 .

[11]  C. L. Mallows Some comments on C_p , 1973 .

[12]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[13]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[14]  Jean-Jacques Daudin,et al.  Estimation of the conditional risk in classification: The swapping method , 2008, Comput. Stat. Data Anal..

[15]  Chunming Zhang Prediction Error Estimation Under Bregman Divergence for Non‐Parametric Regression and Classification , 2008 .

[16]  Michael Kearns,et al.  A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split , 1995, Neural Computation.

[17]  R. Tibshirani,et al.  Model Search by Bootstrap “Bumping” , 1999 .

[18]  Agostino Di Ciaccio,et al.  Estimators of extra-sample error for non- parametric methods. A comparison based on extensive simulations. , 2008 .

[19]  J. Shao Bootstrap Model Selection , 1996 .

[20]  Richard Simon,et al.  A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification , 2007, Statistics in medicine.

[21]  M. Pontil Leave-one-out error and stability of learning algorithms with applications , 2002 .

[22]  Robert Tibshirani,et al.  Model Search and Inference By Bootstrap "bumping , 1995 .

[23]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[24]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[25]  Douglas C. Montgomery,et al.  Resampling methods for variable selection in robust regression , 2003, Comput. Stat. Data Anal..

[26]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[27]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[28]  M. Schumacher,et al.  A Comparison of Nonparametric Error Rate Estimation Methods in Classification Problems , 2004 .

[29]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[30]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[31]  Ji-Hyun Kim,et al.  Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , 2009, Comput. Stat. Data Anal..

[32]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[33]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[34]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[35]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[36]  M. Stone Asymptotics for and against cross-validation , 1977 .

[37]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[38]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[39]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[40]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[41]  Xiaotong Shen,et al.  Adaptive Model Selection , 2002 .