The benefit of data-based model complexity selection via prediction error curves in time-to-event data

The fitting of predictive survival models usually involves determination of model complexity parameters. Up to now, there was no general applicable model selection criterion for semi- or non-parametric approaches. The integrated prediction error curve, an estimator of the integrated Brier score, has the ability to close this gap and allows a reasonable, data-based choice of complexity parameters for any kind of model where risk predictions can be obtained. Random survival forests are used as example throughout the article. Here, a critical complexity parameter might be the number of candidate variables at each node. Model selection by our integrated prediction error curve criterion is compared to a frequently used rule of thumb, investigating the potential benefit regarding prediction performance. For that, simulated microarray survival data as well as two real data sets of patients with diffuse large-B-cell lymphoma and of patients with neuroblastoma are used. It is shown, that the optimal parameter value depends on the amount of information in the data and that a data-based selection can therefore be beneficial in several settings.

[1]  John R Thompson,et al.  Biostatistical Aspects of Genome‐Wide Association Studies , 2008, Biometrical journal. Biometrische Zeitschrift.

[2]  David R. Cox,et al.  Regression models and life tables (with discussion , 1972 .

[3]  Tianxi Cai,et al.  The Performance of Risk Prediction Models , 2008, Biometrical journal. Biometrische Zeitschrift.

[4]  M Radespiel-Tröger,et al.  Association between Split Selection Instability and Predictive Error in Survival Trees , 2006, Methods of Information in Medicine.

[5]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[6]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[7]  E Graf,et al.  Assessment and comparison of prognostic classification schemes for survival data. , 1999, Statistics in medicine.

[8]  Thomas A Gerds,et al.  Efron‐Type Measures of Prediction Error for Survival Analysis , 2007, Biometrics.

[9]  Roland Eils,et al.  Subclassification and Individual Survival Time Prediction from Gene Expression Data of Neuroblastoma Patients by Using CASPAR , 2008, Clinical Cancer Research.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[12]  Harald Binder,et al.  Bioinformatics Applications Note Parallelized Prediction Error Estimation for Evaluation of High-dimensional Models , 2022 .

[13]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[14]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[15]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[16]  Udaya B. Kogalur,et al.  High-Dimensional Variable Selection for Survival Data , 2010 .

[17]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[18]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[19]  Mu Zhu,et al.  Kernels and Ensembles , 2007, 0712.1027.

[20]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[21]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[22]  M. Schumacher,et al.  Consistent Estimation of the Expected Brier Score in General Survival Models with Right‐Censored Event Times , 2006, Biometrical journal. Biometrische Zeitschrift.

[23]  Schumacher Martin,et al.  Adapting Prediction Error Estimates for Biased Complexity Selection in High-Dimensional Bootstrap Samples , 2008 .

[24]  Guido Schwarzer,et al.  Easier parallel computing in R with snowfall and sfCluster , 2009, R J..

[25]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[26]  Harald Binder,et al.  A general, prediction error‐based criterion for selecting model complexity for high‐dimensional survival models , 2010, Statistics in medicine.

[27]  Harald Binder,et al.  Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models , 2008, BMC Bioinformatics.

[28]  Harald Binder,et al.  Assessment of survival prediction models based on microarray data , 2007, Bioinform..