论文信息 - Sample size for binary logistic prediction models: Beyond events per variable criteria

Sample size for binary logistic prediction models: Beyond events per variable criteria

Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.

[1] D G Altman,et al. What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[2] Peter Dalgaard,et al. R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[3] Douglas G. Altman,et al. Adequate sample size for developing prediction models is not simply related to events per variable , 2016, Journal of clinical epidemiology.

[4] Michael R. Harwell,et al. Summarizing Monte Carlo Results in Methodological Research: The One- and Two-Factor Fixed Effects ANOVA Cases , 1992 .

[5] Trevor Hastie,et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[6] H. Zou,et al. Regularization and variable selection via the elastic net , 2005 .

[7] M. G. Pittau,et al. A weakly informative default prior distribution for logistic and other regression models , 2008, 0901.4011.

[8] Gareth Ambler,et al. Review and evaluation of penalised regression methods for risk prediction in low‐dimensional data with few events , 2015, Statistics in medicine.

[9] C J McDonald,et al. Validation of Probabilistic Predictions , 1993, Medical decision making : an international journal of the Society for Medical Decision Making.

[10] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[11] Ewout W Steyerberg,et al. Internal and external validation of predictive models: a simulation study of bias and precision in small samples. , 2003, Journal of clinical epidemiology.

[12] J. C. van Houwelingen,et al. Predictive value of statistical models , 1990 .

[13] G. Brier. VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[14] A. Agresti. Categorical data analysis , 1993 .

[15] Eugene Demidenko,et al. Sample size determination for logistic regression revisited , 2006, Statistics in medicine.

[16] Gary S Collins,et al. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study , 2015, Statistics in medicine.

[17] M. Woodward,et al. Risk prediction models: II. External validation, model updating, and impact assessment , 2012, Heart.

[18] J. Habbema,et al. Prognostic Modeling with Logistic Regression Analysis , 2001, Medical decision making : an international journal of the Society for Medical Decision Making.

[19] Charles E McCulloch,et al. Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[20] Jack P. C. Kleijnen,et al. A methodology for fitting and validating metamodels in simulation , 2000, Eur. J. Oper. Res..

[21] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[22] F. Harrell,et al. Regression modelling strategies for improved prognostic prediction. , 1984, Statistics in medicine.

[23] Georg Heinze,et al. Firth's logistic regression with rare events: accurate effect estimates and predictions? , 2017, Statistics in medicine.

[24] Arthur E. Hoerl,et al. Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[25] N. Obuchowski,et al. Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[26] S. Cessie,et al. Ridge Estimators in Logistic Regression , 1992 .

[27] D. Cox. Two further applications of a model for binary regression , 1958 .

[28] L. Breiman. Better subset regression using the nonnegative garrote , 1995 .

[29] G. Collins,et al. Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist , 2014, PLoS medicine.

[30] Patrick Royston,et al. Simplifying a prognostic model: a simulation study based on clinical data , 2002, Statistics in medicine.