A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

BackgroundIn applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimation by cross-validation (CV)—an approach referred to as “incomplete CV”. Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.MethodsWe devise the easily interpretable and general measure CVIIM (“CV Incompleteness Impact Measure”) to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA.ResultsPerforming normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings.ConclusionsWhile the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.

[1]  Ian A. Wood,et al.  On selection biases with prediction rules formed from gene expression data , 2008 .

[2]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[3]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[4]  J. Montie,et al.  Early cystectomy for clinical stage T1 bladder cancer , 2004, Nature Clinical Practice Urology.

[5]  A. Boulesteix,et al.  A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies , 2015 .

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[8]  Christophe Ambroise,et al.  Selection bias in working with the top genes in supervised classification of tissue samples , 2006 .

[9]  Martin Vingron,et al.  Variance stabilization applied to microarray data calibration and to the quantification of differential expression , 2002, ISMB.

[10]  Anne-Laure Boulesteix,et al.  Correcting the Optimal Resampling‐Based Error Rate by Estimating the Error Rate of Wrapper Algorithms , 2013, Biometrics.

[11]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[12]  Richard Simon,et al.  When is a genomic classifier ready for prime time? , 2004, Nature Clinical Practice Oncology.

[13]  G. Collins,et al.  External validation of multivariable prediction models: a systematic review of methodological conduct and reporting , 2014, BMC Medical Research Methodology.

[14]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[15]  H. Bergman,et al.  Meta-analysis of genetic and environmental Parkinson's disease models reveals a common role of mitochondrial protection pathways , 2012, Neurobiology of Disease.

[16]  Anne-Laure Boulesteix,et al.  Added predictive value of omics data: specific issues related to validation illustrated by two case studies , 2014, BMC Medical Research Methodology.

[17]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[18]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[19]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[20]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[21]  Anne-Laure Boulesteix,et al.  Bmc Medical Research Methodology Open Access Optimal Classifier Selection and Negative Bias in Error Rate Estimation: an Empirical Study on High-dimensional Prediction , 2022 .

[22]  Anne-Laure Boulesteix,et al.  Cross-study validation for the assessment of prediction algorithms , 2014, Bioinform..

[23]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[25]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[26]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[27]  Katja Ickstadt,et al.  Reducing the probability of false positive research findings by pre-publication validation – Experience with a large multiple sclerosis database , 2015 .

[28]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[29]  Kerrie L. Mengersen,et al.  Classification based upon gene expression data: bias and precision of error rates , 2007, Bioinform..

[30]  Anne-Laure Boulesteix,et al.  On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al , 2013, Bioinform..

[31]  David M. Rocke,et al.  Dimension Reduction for Classification with Gene Expression Microarray Data , 2006, Statistical applications in genetics and molecular biology.

[32]  Rainer Spang,et al.  Microarray Based Diagnosis Profits from Better Documentation of Gene Expression Signatures , 2008, PLoS Comput. Biol..