On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

Model selection strategies for machine learning algorithms typically involve the numerical optimisation of an appropriate model selection criterion, often based on an estimator of generalisation performance, such as k-fold cross-validation. The error of such an estimator can be broken down into bias and variance components. While unbiasedness is often cited as a beneficial quality of a model selection criterion, we demonstrate that a low variance is at least as important, as a non-negligible variance introduces the potential for over-fitting in model selection as well as in training the model. While this observation is in hindsight perhaps rather obvious, the degradation in performance due to over-fitting the model selection criterion can be surprisingly large, an observation that appears to have received little attention in the machine learning literature to date. In this paper, we show that the effects of this form of over-fitting are often of comparable magnitude to differences in performance between learning algorithms, and thus cannot be ignored in empirical evaluation. Furthermore, we show that some common performance evaluation practices are susceptible to a form of selection bias as a result of this form of over-fitting and hence are unreliable. We discuss methods to avoid over-fitting in model selection and subsequent selection bias in performance evaluation, which we hope will be incorporated into best practice. While this study concentrates on cross-validation based model selection, the findings are quite general and apply to any model selection practice involving the optimisation of a model selection criterion evaluated over a finite sample of data, including maximisation of the Bayesian evidence and optimisation of performance bounds.

[1]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[2]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[3]  G. Cawley,et al.  Efficient approximate leave-one-out cross-validation for kernel logistic regression , 2008, Machine Learning.

[4]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[5]  Sanjeev R. Kulkarni,et al.  Learning Pattern Classification - A Survey , 1998, IEEE Trans. Inf. Theory.

[6]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[7]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  David G. Stork,et al.  Pattern Classification , 1973 .

[10]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[11]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[12]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[13]  Gavin C. Cawley,et al.  Generalised Kernel Machines , 2007, 2007 International Joint Conference on Neural Networks.

[14]  Tong Zhang,et al.  Leave-One-Out Bounds for Kernel Methods , 2003, Neural Computation.

[15]  Andrew P. Robinson,et al.  Reducing variability of crossvalidation for smoothing-parameter choice , 2009 .

[16]  Gavin C. Cawley,et al.  Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters , 2007, J. Mach. Learn. Res..

[17]  Gavin C. Cawley,et al.  Leave-One-Out Cross-Validation Based Model Selection Criteria for Weighted LS-SVMs , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[18]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[19]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[21]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[22]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[23]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[24]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Senjian An,et al.  Fast cross-validation algorithms for least squares support vector machine and kernel ridge regression , 2007, Pattern Recognit..

[26]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[27]  Ingo Steinwart,et al.  On the Optimal Parameter Choice for v-Support Vector Machines , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Neil D. Lawrence,et al.  Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis , 2006, J. Mach. Learn. Res..

[29]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[30]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[31]  Gunnar Rätsch,et al.  Constructing Descriptive and Discriminative Nonlinear Features: Rayleigh Coefficients in Kernel Feature Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[33]  Gavin C. Cawley,et al.  Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers , 2003, Pattern Recognit..

[34]  Huanhuan Chen,et al.  Probabilistic Classification Vector Machines , 2009, IEEE Transactions on Neural Networks.

[35]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[36]  Liefeng Bo,et al.  Feature Scaling for Kernel Fisher Discriminant Analysis Using Leave-One-Out Cross Validation , 2006 .

[37]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[38]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[39]  Isabelle Guyon,et al.  Model Selection: Beyond the Bayesian/Frequentist Divide , 2010, J. Mach. Learn. Res..

[40]  Jason Weston Leave-One-Out Support Vector Machines , 1999, IJCAI.

[41]  J. Mercer Functions of positive and negative type, and their connection with the theory of integral equations , 1909 .

[42]  Godfried T. Toussaint,et al.  Bibliography on estimation of misclassification , 1974, IEEE Trans. Inf. Theory.

[43]  Stephen A. Billings,et al.  Nonlinear Fisher discriminant analysis using a minimum squared error cost function and the orthogonal least squares algorithm , 2002, Neural Networks.

[44]  Yuan Qi,et al.  Predictive automatic relevance determination by expectation propagation , 2004, ICML.

[45]  Wei Chu,et al.  Bayesian Trigonometric Support Vector Classifier , 2003, Neural Computation.

[46]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[47]  Andreas Wendemuth,et al.  Kernel Least-Squares Models Using Updates of the Pseudoinverse , 2006, Neural Computation.

[48]  M. Stone Asymptotics for and against cross-validation , 1977 .

[49]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[50]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[51]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[52]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[53]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[54]  Weifeng Liu,et al.  Adaptive and Learning Systems for Signal Processing, Communication, and Control , 2010 .

[55]  Carl Gold,et al.  Bayesian approach to feature selection and parameter tuning for support vector machine classifiers , 2005, Neural Networks.

[56]  R. Rifkin,et al.  Notes on Regularized Least Squares , 2007 .