Review and evaluation of penalised regression methods for risk prediction in low‐dimensional data with few events

Risk prediction models are used to predict a clinical outcome for patients using a set of predictors. We focus on predicting low‐dimensional binary outcomes typically arising in epidemiology, health services and public health research where logistic regression is commonly used. When the number of events is small compared with the number of regression coefficients, model overfitting can be a serious problem. An overfitted model tends to demonstrate poor predictive accuracy when applied to new data. We review frequentist and Bayesian shrinkage methods that may alleviate overfitting by shrinking the regression coefficients towards zero (some methods can also provide more parsimonious models by omitting some predictors). We evaluated their predictive performance in comparison with maximum likelihood estimation using real and simulated data. The simulation study showed that maximum likelihood estimation tends to produce overfitted models with poor predictive performance in scenarios with few events, and penalised methods can offer improvement. Ridge regression performed well, except in scenarios with many noise predictors. Lasso performed better than ridge in scenarios with many noise predictors and worse in the presence of correlated predictors. Elastic net, a hybrid of the two, performed well in all scenarios. Adaptive lasso and smoothly clipped absolute deviation performed best in scenarios with many noise predictors; in other scenarios, their performance was inferior to that of ridge and lasso. Bayesian approaches performed well when the hyperparameters for the priors were chosen carefully. Their use may aid variable selection, and they can be easily extended to clustered‐data settings and to incorporate external information. © 2015 The Authors. Statistics in Medicine Published by JohnWiley & Sons Ltd.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[3]  G. Casella,et al.  Penalized regression, standard errors, and Bayesian lassos , 2010 .

[4]  Veronika Rockova,et al.  Hierarchical Bayesian formulations for selecting variables in regression models , 2012, Statistics in medicine.

[5]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[6]  E. Steyerberg Clinical Prediction Models , 2008, Statistics for Biology and Health.

[7]  Charles E McCulloch,et al.  Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[8]  Harald Binder,et al.  Sparse regression techniques in low-dimensional survival data settings , 2010, Stat. Comput..

[9]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[10]  Patrick Royston,et al.  Simplifying a prognostic model: a simulation study based on clinical data , 2002, Statistics in medicine.

[11]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[12]  J. Habbema,et al.  Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. , 2000, Statistics in medicine.

[13]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[14]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[15]  P. J. Verweij,et al.  Cross-validation in survival analysis. , 1993, Statistics in medicine.

[16]  C.J.H. Mann,et al.  Clinical Prediction Models: A Practical Approach to Development, Validation and Updating , 2009 .

[17]  Tesi di Dottorato,et al.  Penalized Regression: bootstrap confidence intervals and variable selection for high dimensional data sets. , 2010 .

[18]  M. Pencina,et al.  General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study , 2008, Circulation.

[19]  G. Collins,et al.  Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement , 2015, BMC Medicine.

[20]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[21]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[22]  K. Stoeber,et al.  DNA Replication Licensing Factors and Aneuploidy Are Linked to Tumor Cell Cycle State and Clinical Outcome in Penile Carcinoma , 2009, Clinical Cancer Research.

[23]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[24]  R Z Omar,et al.  An evaluation of penalised survival methods for developing prognostic models with rare events , 2012, Statistics in medicine.

[25]  B. Efron Frequentist accuracy of Bayesian estimates , 2015, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[26]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[27]  Axel Benner,et al.  High‐Dimensional Cox Models: The Choice of Penalty as Part of the Model Building Process , 2010, Biometrical journal. Biometrische Zeitschrift.

[28]  Ewout W Steyerberg,et al.  Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints , 2014, BMC Medical Research Methodology.

[29]  A. Gelman Scaling regression inputs by dividing by two standard deviations , 2008, Statistics in medicine.

[30]  T. Alonzo Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating By Ewout W. Steyerberg , 2009 .

[31]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[32]  Wei Pan,et al.  Penalized regression and risk prediction in genome‐wide association studies , 2013, Stat. Anal. Data Min..

[33]  P. Royston,et al.  Prognosis and prognostic research: application and impact of prognostic models in clinical practice , 2009, BMJ : British Medical Journal.

[34]  秀俊 松井,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2014 .

[35]  S. Sartori PENALIZED REGRESSION: BOOTSTRAP CONFIDENCE INTERVALS AND VARIABLE SELECTION FOR HIGH-DIMENSIONAL DATA SETS , 2011 .

[36]  S. Roberts,et al.  Stabilizing the lasso against cross-validation variability , 2014, Comput. Stat. Data Anal..

[37]  L. Fahrmeir,et al.  High dimensional structured additive regression models: Bayesian regularization, smoothing and predictive performance , 2011 .

[38]  M. Woodward,et al.  Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker , 2012, Heart.

[39]  G. Oehlert A note on the delta method , 1992 .

[40]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[41]  S. Lahiri,et al.  Bootstrapping Lasso Estimators , 2011 .

[42]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[43]  Yvonne Vergouwe,et al.  Prognosis and prognostic research: what, why, and how? , 2009, BMJ : British Medical Journal.

[44]  Ludwig Fahrmeir,et al.  Bayesian regularisation in structured additive regression: a unifying perspective on shrinkage, smoothing and predictor selection , 2010, Stat. Comput..

[45]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[46]  R. O’Hara,et al.  A review of Bayesian variable selection methods: what, how and which , 2009 .

[47]  E. Steyerberg,et al.  Reporting and Methods in Clinical Prediction Research: A Systematic Review , 2012, PLoS medicine.

[48]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[49]  A. Sheikh,et al.  Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2 , 2008, BMJ : British Medical Journal.

[50]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[51]  Giuseppe Limongelli,et al.  A novel clinical risk prediction model for sudden cardiac death in hypertrophic cardiomyopathy (HCM risk-SCD). , 2014, European heart journal.