Sample size for binary logistic prediction models: Beyond events per variable criteria

Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.

[1]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[2]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[3]  Douglas G. Altman,et al.  Adequate sample size for developing prediction models is not simply related to events per variable , 2016, Journal of clinical epidemiology.

[4]  Michael R. Harwell,et al.  Summarizing Monte Carlo Results in Methodological Research: The One- and Two-Factor Fixed Effects ANOVA Cases , 1992 .

[5]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  M. G. Pittau,et al.  A weakly informative default prior distribution for logistic and other regression models , 2008, 0901.4011.

[8]  Gareth Ambler,et al.  Review and evaluation of penalised regression methods for risk prediction in low‐dimensional data with few events , 2015, Statistics in medicine.

[9]  C J McDonald,et al.  Validation of Probabilistic Predictions , 1993, Medical decision making : an international journal of the Society for Medical Decision Making.

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  Ewout W Steyerberg,et al.  Internal and external validation of predictive models: a simulation study of bias and precision in small samples. , 2003, Journal of clinical epidemiology.

[12]  J. C. van Houwelingen,et al.  Predictive value of statistical models , 1990 .

[13]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[14]  A. Agresti Categorical data analysis , 1993 .

[15]  Eugene Demidenko,et al.  Sample size determination for logistic regression revisited , 2006, Statistics in medicine.

[16]  Gary S Collins,et al.  Sample size considerations for the external validation of a multivariable prognostic model: a resampling study , 2015, Statistics in medicine.

[17]  M. Woodward,et al.  Risk prediction models: II. External validation, model updating, and impact assessment , 2012, Heart.

[18]  J. Habbema,et al.  Prognostic Modeling with Logistic Regression Analysis , 2001, Medical decision making : an international journal of the Society for Medical Decision Making.

[19]  Charles E McCulloch,et al.  Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[20]  Jack P. C. Kleijnen,et al.  A methodology for fitting and validating metamodels in simulation , 2000, Eur. J. Oper. Res..

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  F. Harrell,et al.  Regression modelling strategies for improved prognostic prediction. , 1984, Statistics in medicine.

[23]  Georg Heinze,et al.  Firth's logistic regression with rare events: accurate effect estimates and predictions? , 2017, Statistics in medicine.

[24]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[25]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[26]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[27]  D. Cox Two further applications of a model for binary regression , 1958 .

[28]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[29]  G. Collins,et al.  Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist , 2014, PLoS medicine.

[30]  Patrick Royston,et al.  Simplifying a prognostic model: a simulation study based on clinical data , 2002, Statistics in medicine.

[31]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[32]  N. Mantel Why Stepdown Procedures in Variable Selection , 1970 .

[33]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[34]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[35]  Douglas G. Altman,et al.  No rationale for 1 variable per 10 events criterion for binary logistic regression analysis , 2016, BMC Medical Research Methodology.

[36]  E. Steyerberg,et al.  Reporting and Methods in Clinical Prediction Research: A Systematic Review , 2012, PLoS medicine.

[37]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[38]  Thomas Agoritsas,et al.  Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. , 2011, Journal of clinical epidemiology.

[39]  E. Steyerberg,et al.  [Regression modeling strategies]. , 2011, Revista espanola de cardiologia.

[40]  B. Efron,et al.  Stein's Paradox in Statistics , 1977 .

[41]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[42]  M. Woodward,et al.  Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker , 2012, Heart.

[43]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[44]  M. S. Rahman,et al.  Performance of Firth-and logF-type penalized methods in risk prediction for small or sparse binary data , 2017, BMC Medical Research Methodology.

[45]  J. Gart,et al.  On the bias of various estimators of the logit and its variance with application to quantal bioassay. , 1967, Biometrika.

[46]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[47]  S. le Cessie,et al.  Predictive value of statistical models. , 1990, Statistics in medicine.

[48]  G. Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement , 2015, Annals of Internal Medicine.

[49]  N P Jewell,et al.  Small-sample bias of point estimators of the odds ratio from matched sets. , 1984, Biometrics.

[50]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[51]  Georg Heinze,et al.  A comparative investigation of methods for logistic regression with separated or nearly separated data , 2006, Statistics in medicine.

[52]  Daniel B. Mark,et al.  TUTORIAL IN BIOSTATISTICS MULTIVARIABLE PROGNOSTIC MODELS: ISSUES IN DEVELOPING MODELS, EVALUATING ASSUMPTIONS AND ADEQUACY, AND MEASURING AND REDUCING ERRORS , 1996 .

[53]  Sander Greenland,et al.  Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions , 2015, Statistics in medicine.

[54]  Gareth Ambler,et al.  How to develop a more accurate risk prediction model when there are few events , 2015, BMJ : British Medical Journal.

[55]  Steven Teerenstra,et al.  A computational approach to compare regression modelling strategies in prediction research , 2016, BMC Medical Research Methodology.

[56]  D. Altman,et al.  Bootstrap investigation of the stability of a Cox regression model. , 1989, Statistics in medicine.

[57]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[58]  E. Steyerberg Clinical Prediction Models , 2008, Statistics for Biology and Health.

[59]  J. Habbema,et al.  Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. , 2000, Statistics in medicine.

[60]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[61]  Anna Genell,et al.  Bias in odds ratios by logistic regression modelling and sample size , 2009, BMC medical research methodology.

[62]  M Schumacher,et al.  A bootstrap resampling procedure for model building: application to the Cox regression model. , 1992, Statistics in medicine.