On the cross‐validation bias due to unsupervised preprocessing

Cross‐validation is the de facto standard for predictive model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of data‐dependent preprocessing, such as mean‐centring, rescaling, dimensionality reduction and outlier removal. It is often believed that such preprocessing stages, if done in an unsupervised manner (that does not incorporate the class labels or response values) are generally safe to do prior to cross‐validation. In this paper, we study three commonly practised preprocessing procedures prior to a regression analysis: (i) variance‐based feature selection; (ii) grouping of rare categorical features; and (iii) feature rescaling. We demonstrate that unsupervised preprocessing can, in fact, introduce a substantial bias into cross‐validation estimates and potentially hurt model selection. This bias may be either positive or negative and its exact magnitude depends on all the parameters of the problem in an intricate manner. Further research is needed to understand the real‐world impact of this bias across different application domains, particularly when dealing with small sample sizes and high‐dimensional data.

[1]  O. Bousquet,et al.  Sharper bounds for uniformly stable algorithms , 2019, COLT.

[2]  Jan Vondrák,et al.  High probability generalization bounds for uniformly stable algorithms with nearly optimal rate , 2019, COLT.

[3]  Jan Vondrák,et al.  Generalization Bounds for Uniformly Stable Algorithms , 2018, NeurIPS.

[4]  R. Guigó Faculty Opinions recommendation of CRISPRi-based genome-scale identification of functional long noncoding RNA loci in human cells. , 2018, Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature.

[5]  R. Horisaki,et al.  Ghost cytometry , 2018, Science.

[6]  Robert Tibshirani,et al.  Noninvasive blood tests for fetal development predict gestational age and preterm delivery , 2018, Science.

[7]  T. Nemecek,et al.  Reducing food’s environmental impacts through producers and consumers , 2018, Science.

[8]  Luca Pagani,et al.  Ancient human parallel lineages within North America contributed to a coastal expansion , 2018, Science.

[9]  D. R. Robertson,et al.  Fish reproductive-energy output increases disproportionately with body size , 2018, Science.

[10]  N. Basu,et al.  Legacy nitrogen may prevent achievement of water quality goals in the Gulf of Mexico , 2018, Science.

[11]  Derek T. Ahneman,et al.  Predicting reaction performance in C–N cross-coupling using machine learning , 2018, Science.

[12]  Z. Zhou,et al.  Structure of the herpes simplex virus 1 capsid with associated tegument protein complexes , 2018, Science.

[13]  O. Marín,et al.  Early emergence of cortical interneuron diversity in the mouse embryo , 2018, Science.

[14]  K. Hamidieh A data-driven statistical model for predicting the critical temperature of a superconductor , 2018, Computational Materials Science.

[15]  J. Good,et al.  Winter color polymorphisms identify global hot spots for evolutionary rescue from climate change , 2018, Science.

[16]  Jose M Carmena,et al.  Evidence for a neural law of effect , 2018, Science.

[17]  Ludmila V. Danilova,et al.  Detection and localization of surgically resectable cancers with a multi-analyte blood test , 2018, Science.

[18]  Víctor Soria-Carrasco,et al.  Natural selection and the predictability of evolution in Timema stick insects , 2018, Science.

[19]  Roslyn Dakin,et al.  Morphology, muscle capacity, skill, and maneuvering ability in hummingbirds , 2018, Science.

[20]  Amy M. Ni,et al.  Learning and attention reveal a general relationship between population activity and behavior , 2018, Science.

[21]  Antonino Ingargiola,et al.  Toward dynamic structural biology: Two decades of single-molecule Förster resonance energy transfer , 2018, Science.

[22]  N. Fierer,et al.  A global atlas of the dominant bacteria found in soil , 2018, Science.

[23]  Bhaskara Marthi,et al.  A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs , 2017, Science.

[24]  B. Kuster,et al.  The target landscape of clinical kinase drugs , 2017, Science.

[25]  A. Eldering,et al.  Influence of El Niño on atmospheric CO2 over the tropical Pacific Ocean: Findings from NASA’s OCO-2 mission , 2017, Science.

[26]  Lu Zhang,et al.  History of winning remodels thalamo-PFC circuit to reinforce social dominance , 2017, Science.

[27]  Davide Mazza,et al.  Reticulon 3–dependent ER-PM contact sites control EGFR nonclathrin endocytosis , 2017, Science.

[28]  Amit Dhurandhar,et al.  Predicting human olfactory perception from chemical features of odor molecules , 2017, Science.

[29]  Nils B. Weidmann,et al.  Predicting armed conflict: Time to adjust our expectations? , 2017, Science.

[30]  David Lazer,et al.  Improving election prediction internationally , 2017, Science.

[31]  S. Elledge,et al.  Tumor aneuploidy correlates with markers of immune evasion and with reduced response to immunotherapy , 2017, Science.

[32]  David Modiano,et al.  Resistance to malaria through structural variation of red blood cell invasion receptors , 2016, Science.

[33]  James Zou,et al.  How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage , 2015, IEEE Transactions on Information Theory.

[34]  Rory Wilson,et al.  A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization , 2015, BMC Medical Research Methodology.

[35]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[36]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[37]  Begoña Garcia-Zapirain,et al.  EEG artifact removal—state-of-the-art and guidelines , 2015, Journal of neural engineering.

[38]  Hadley Wickham,et al.  R for Data Science: Import, Tidy, Transform, Visualize, and Model Data , 2014 .

[39]  Doug Speed,et al.  MultiBLUP: improved SNP-based prediction for complex traits , 2014, Genome research.

[40]  Boaz Nadler,et al.  On the exact Berk-Jones statistics and their p-value calculation , 2013, 1311.3190.

[41]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[42]  Saharon Rosset,et al.  Leakage in data mining: formulation, detection, and avoidance , 2011, TKDD.

[43]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[44]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[45]  P. Visscher,et al.  Common SNPs explain a large proportion of the heritability for human height , 2010, Nature Genetics.

[46]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[47]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[48]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[49]  Shie Mannor,et al.  Robust Regression and Lasso , 2008, IEEE Transactions on Information Theory.

[50]  Massimiliano Pontil,et al.  Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers , 2004, Machine Learning.

[51]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[52]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[54]  Detection and Localization , 2021, Computer Vision.

[55]  Peter Baumgartner,et al.  R – Data Science , 2017 .

[56]  Ewout W. Steyerberg,et al.  Focus on : Contemporary Methods in Biostatistics ( I ) Regression Modeling Strategies , 2017 .

[57]  Chih-Chung Chang,et al.  A Practical Guide to Support Vector Classification , 2009 .

[58]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[59]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[60]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[61]  Adam Tauman Kalai,et al.  The Hebrew University , 1998 .

[62]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .