Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples

The bootstrap is a tool that allows for efficient evaluation of prediction performance of statistical techniques without having to set aside data for validation. This is especially important for high-dimensional data, e.g., arising from microarrays, because there the number of observations is often limited. For avoiding overoptimism the statistical technique to be evaluated has to be applied to every bootstrap sample in the same manner it would be used on new data. This includes a selection of complexity, e.g., the number of boosting steps for gradient boosting algorithms. Using the latter, we demonstrate in a simulation study that complexity selection in conventional bootstrap samples, drawn with replacement, is severely biased in many scenarios. This translates into a considerable bias of prediction error estimates, often underestimating the amount of information that can be extracted from high-dimensional data. Potential remedies for this complexity selection bias, such as alternatively using a fixed level of complexity or of using sampling without replacement are investigated and it is shown that the latter works well in many settings. We focus on high-dimensional binary response data, with bootstrap .632+ estimates of the Brier score for performance evaluation, and censored time-to-event data with .632+ prediction error curve estimates. The latter, with the modified bootstrap procedure, is then applied to an example with microarray data from patients with diffuse large B-cell lymphoma.

[1]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[2]  Trevor Hastie,et al.  Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 2008 .

[3]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[4]  Torsten Hothorn,et al.  Model-based boosting in high dimensions , 2006, Bioinform..

[5]  Harald Binder,et al.  Assessment of survival prediction models based on microarray data , 2007, Bioinform..

[6]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[7]  Ralf Bender,et al.  Generating survival times to simulate Cox proportional hazards models , 2005, Statistics in medicine.

[8]  J. G. Liao,et al.  Logistic regression for disease classification using microarray data: model selection in a large p and small n case , 2007, Bioinform..

[9]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[10]  Thomas A Gerds,et al.  Efron‐Type Measures of Prediction Error for Survival Analysis , 2007, Biometrics.

[11]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[12]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[13]  Ian A. Wood,et al.  On selection biases with prediction rules formed from gene expression data , 2008 .

[14]  Mee Young Park,et al.  L 1-regularization path algorithm for generalized linear models , 2006 .

[15]  Tommi S. Jaakkola,et al.  Bias-Corrected Bootstrap and Model Uncertainty , 2003, NIPS.

[16]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[17]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[18]  Richard Simon,et al.  A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification , 2007, Statistics in medicine.

[19]  M. Schumacher,et al.  Consistent Estimation of the Expected Brier Score in General Survival Models with Right‐Censored Event Times , 2006, Biometrical journal. Biometrische Zeitschrift.

[20]  G. Ridgeway The State of Boosting ∗ , 1999 .

[21]  Wenjiang J. Fu,et al.  Estimating misclassification error with small samples via bootstrap cross-validation , 2005, Bioinform..

[22]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[23]  Joanna H Shih,et al.  Appropriateness of some resampling‐based inference procedures for assessing performance of prognostic classifiers derived from microarray data , 2007, Statistics in medicine.

[24]  O. William Journal Of The American Statistical Association V-28 , 1932 .

[25]  Peter Buhlmann Boosting for high-dimensional linear models , 2006, math/0606789.

[26]  E Graf,et al.  Assessment and comparison of prognostic classification schemes for survival data. , 1999, Statistics in medicine.

[27]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[28]  Jiang Gui,et al.  Partial Cox regression analysis for high-dimensional microarray gene expression data , 2004, ISMB/ECCB.

[29]  M. Segal Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited. , 2006, Biostatistics.

[30]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..