Inference for the Generalization Error

In order to compare learning algorithms, experimental results reported in the machine learning literature often use statistical tests of significance to support the claim that a new learning algorithm generalizes better. Such tests should take into account the variability due to the choice of training set and not only that due to the test examples, as is often the case. This could lead to gross underestimation of the variance of the cross-validation estimator, and to the wrong conclusion that the new algorithm is significantly better when it is not. We perform a theoretical investigation of the variance of a variant of the cross-validation estimator of the generalization error that takes into account the variability due to the randomness of the training set as well as test examples. Our analysis shows that all the variance estimators that are based only on the results of the cross-validation experiment must be biased. This analysis allows us to propose new estimators of this variance. We show, via simulations, that tests of hypothesis about the generalization error using those new variance estimators have better properties than tests involving variance estimators currently in use and listed in Dietterich (1998). In particular, the new tests have correct size and good power. That is, the new tests do not reject the null hypothesis too often when the hypothesis is true, but they tend to frequently reject the null hypothesis when the latter is false.

[1]  John Bibby,et al.  The Analysis of Contingency Tables , 1978 .

[2]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  John F. Kolen,et al.  Backpropagation is Sensitive to Initial Conditions , 1990, Complex Syst..

[5]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[6]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[7]  D. Wolpert,et al.  No Free Lunch Theorems for Search , 1995 .

[8]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[9]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[10]  Huaiyu Zhu,et al.  No Free Lunch for Cross-Validation , 1996, Neural Computation.

[11]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[12]  Cyril Goutte,et al.  Note on Free Lunches and Cross-Validation , 1997, Neural Computation.

[13]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[14]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[15]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[16]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[17]  M. Kearns,et al.  Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[18]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[19]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .