Tracking Cross-Validated Estimates of Prediction Error as Studies Accumulate

In recent years, “reproducibility” has emerged as a key factor in evaluating x applications of statistics to the biomedical sciences, for example, learning predictors of disease phenotypes from high-throughput “omics” data. In particular, “validation” is undermined when error rates on newly acquired data are sharply higher than those originally reported. More precisely, when data are collected from m “studies” representing possibly different subphenotypes, more generally different mixtures of subphenotypes, the error rates in cross-study validation (CSV) are observed to be larger than those obtained in ordinary randomized cross-validation (RCV), although the “gap” seems to close as m increases. Whereas these findings are hardly surprising for a heterogenous underlying population, this discrepancy is then seen as a barrier to translational research. We provide a statistical formulation in the large-sample limit: studies themselves are modeled as components of a mixture and all error rates are optimal (Bayes) for a two-class problem. Our results cohere with the trends observed in practice and suggest what is likely to be observed with large samples and consistent density estimators, namely, that the CSV error rate exceeds the RCV error rates for any m, the latter (appropriately averaged) increases with m, and both converge to the optimal rate for the whole population.

[1]  R. Altman,et al.  Pharmacogenomics: will the promise be fulfilled? , 2011, Nature Reviews Genetics.

[2]  D. W. Scott,et al.  Biased and Unbiased Cross-Validation in Density Estimation , 1987 .

[3]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[4]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[5]  Richard O'Mara Waiting for the Revolution , 1989 .

[6]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[7]  G. Omenn,et al.  Evolution of Translational Omics: Lessons Learned and the Path Forward , 2013 .

[8]  D. Geman,et al.  Computational Medicine: Translating Models to Clinical Care , 2012 .

[9]  Cory C. Funk,et al.  Systems approaches to molecular cancer diagnostics. , 2010, Discovery medicine.

[10]  Jaeyun Sung,et al.  Molecular signatures from omics data: From chaos to consensus , 2012, Biotechnology journal.

[11]  Jaeyun Sung,et al.  Measuring the Effect of Inter-Study Variability on Estimating Prediction Error , 2014, PloS one.

[12]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[13]  Donald Geman,et al.  Merging microarray data from separate breast cancer studies provides a robust prognostic test , 2008, BMC Bioinformatics.

[14]  PAUL KIRK,et al.  Balancing the Robustness and Predictive Performance of Biomarkers , 2013, J. Comput. Biol..

[15]  Gary A. Churchill,et al.  Randomization in Laboratory Procedure Is Key to Obtaining Reproducible Microarray Results , 2008, PloS one.

[16]  U. Braga-Neto,et al.  Fads and fallacies in the name of small-sample microarray classification - A highlight of misunderstanding and erroneous usage in the applications of genomic signal processing , 2007, IEEE Signal Processing Magazine.

[17]  I. Ellis,et al.  A consensus prognostic gene expression classifier for ER positive breast cancer , 2006, Genome Biology.

[18]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[19]  Daniel Q. Naiman,et al.  Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data , 2005, Bioinform..

[20]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[21]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[22]  Debashis Ghosh,et al.  Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data , 2004, BMC Genomics.

[23]  R. Tibshirani,et al.  A bias correction for the minimum error rate in cross-validation , 2009, 0908.2904.

[24]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[26]  T. Marteau,et al.  Deflating the Genomic Bubble , 2011, Science.