Cross-validation and bootstrapping are unreliable in small sample classification

The interest in statistical classification for critical applications such as diagnoses of patient samples based on supervised learning is rapidly growing. To gain acceptance in applications where the subsequent decisions have serious consequences, e.g. choice of cancer therapy, any such decision support system must come with a reliable performance estimate. Tailored for small sample problems, cross-validation (CV) and bootstrapping (BTS) have been the most commonly used methods to determine such estimates in virtually all branches of science for the last 20 years. Here, we address the often overlooked fact that the uncertainty in a point estimate obtained with CV and BTS is unknown and quite large for small sample classification problems encountered in biomedical applications and elsewhere. To avoid this fundamental problem of employing CV and BTS, until improved alternatives have been established, we suggest that the final classification performance always should be reported in the form of a Bayesian confidence interval obtained from a simple holdout test or using some other method that yields conservative measures of the uncertainty.

[1]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[2]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[3]  Mark R. Wade,et al.  Construction and Assessment of Classification Rules , 1999, Technometrics.

[4]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[5]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[6]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[7]  John Langford,et al.  A comparison of tight generalization error bounds , 2005, ICML '05.

[8]  David J. Hand,et al.  Ten More Years of Error Rate Research , 2000 .

[9]  E. Jaynes,et al.  Confidence Intervals vs Bayesian Intervals , 1976 .

[10]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[11]  Robert P.W. Duin,et al.  PRTools3: A Matlab Toolbox for Pattern Recognition , 2000 .

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[14]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[15]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[16]  Sayan Mukherjee,et al.  Estimating Dataset Size Requirements for Classifying DNA Microarray Data , 2003, J. Comput. Biol..

[17]  D. J. Hand,et al.  Recent advances in error rate estimation , 1986, Pattern Recognit. Lett..

[18]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[19]  Edward R. Dougherty,et al.  Relation Between Permutation-Test P Values and Classifier Error Estimates , 2004, Machine Learning.

[20]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[21]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[22]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[23]  Hanna Göransson,et al.  Improved variance estimation of classification performance via reduction of bias caused by small sample size , 2006, BMC Bioinformatics.

[24]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[25]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[26]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[27]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[28]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[29]  Richard M. Simon,et al.  A Paradigm for Class Prediction Using Gene Expression Profiles , 2003, J. Comput. Biol..

[30]  C. Hooker,et al.  Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science , 1976 .

[31]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[32]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[33]  L. Brown,et al.  Interval Estimation for a Binomial Proportion , 2001 .

[34]  Paul Pukite,et al.  Foundations of Probability Theory , 1998 .

[35]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[36]  E. Dougherty,et al.  Confidence Intervals for the True Classification Error Conditioned on the Estimated Error , 2006, Technology in cancer research & treatment.