Bias correction for selecting the minimal-error classifier from many machine learning models

MOTIVATION Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30-60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts. RESULTS In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package 'MLbias' and all source files are publicly available. AVAILABILITY AND IMPLEMENTATION tsenglab.biostat.pitt.edu/software.htm. CONTACT ctseng@pitt.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Edward R. Dougherty,et al.  Reporting bias when using real data sets to analyze classification performance , 2010, Bioinform..

[2]  Anne-Laure Boulesteix,et al.  Correcting the Optimal Resampling‐Based Error Rate by Estimating the Error Rate of Wrapper Algorithms , 2013, Biometrics.

[3]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[4]  Anne-Laure Boulesteix,et al.  Correcting the optimally selected resampling-based error rate: A smooth analytical alternative to nested cross-validation , 2011 .

[5]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[6]  Werner Dubitzky,et al.  Avoiding model selection bias in small-sample genomic datasets , 2006, Bioinform..

[7]  Ranjan Das,et al.  Biomedical Research Methodology , 2011 .

[8]  Anne-Laure Boulesteix,et al.  Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction , 2009, BMC medical research methodology.

[9]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[10]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[11]  Kerrie L. Mengersen,et al.  Classification based upon gene expression data: bias and precision of error rates , 2007, Bioinform..

[12]  R. Tibshirani,et al.  A bias correction for the minimum error rate in cross-validation , 2009, 0908.2904.

[13]  Sayan Mukherjee,et al.  Estimating Dataset Size Requirements for Classifying DNA Microarray Data , 2003, J. Comput. Biol..

[14]  Anne-Laure Boulesteix,et al.  CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data , 2008, BMC Bioinformatics.

[15]  B. Efron Empirical Bayes Estimates for Large-Scale Prediction Problems , 2009, Journal of the American Statistical Association.

[16]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..