Adequate sample size for developing prediction models is not simply related to events per variable

Objectives The choice of an adequate sample size for a Cox regression analysis is generally based on the rule of thumb derived from simulation studies of a minimum of 10 events per variable (EPV). One simulation study suggested scenarios in which the 10 EPV rule can be relaxed. The effect of a range of binary predictors with varying prevalence, reflecting clinical practice, has not yet been fully investigated. Study Design and Setting We conducted an extended resampling study using a large general-practice data set, comprising over 2 million anonymized patient records, to examine the EPV requirements for prediction models with low-prevalence binary predictors developed using Cox regression. The performance of the models was then evaluated using an independent external validation data set. We investigated both fully specified models and models derived using variable selection. Results Our results indicated that an EPV rule of thumb should be data driven and that EPV ≥ 20 generally eliminates bias in regression coefficients when many low-prevalence predictors are included in a Cox model. Conclusion Higher EPV is needed when low-prevalence predictors are present in a model to eliminate bias in regression coefficients and improve predictive accuracy.

[1]  Patrick Royston,et al.  The design of simulation studies in medical statistics , 2006, Statistics in medicine.

[2]  Charles E McCulloch,et al.  Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[3]  Patrick Royston,et al.  A new measure of prognostic separation in survival data , 2004, Statistics in medicine.

[4]  Letter to the editor , 2012 .

[5]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[6]  Thomas Agoritsas,et al.  Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. , 2011, Journal of clinical epidemiology.

[7]  F. Harrell,et al.  Regression models for prognostic prediction: advantages, problems, and suggested solutions. , 1985, Cancer treatment reports.

[8]  Ewout W Steyerberg,et al.  Logistic regression modeling and the number of events per variable: selection bias dominates. , 2011, Journal of clinical epidemiology.

[9]  Although we appreciate the authors' efforts in conducting their comparative study, we disagree with some of the conclusions drawn. , 2012, Statistical methods in medical research.

[10]  J. Concato,et al.  Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. , 1995, Journal of clinical epidemiology.

[11]  M Schemper,et al.  A Solution to the Problem of Monotone Likelihood in Cox Regression , 2001, Biometrics.

[12]  P Peduzzi,et al.  Importance of events per independent variable in proportional hazards analysis. I. Background, goals, and general strategy. , 1995, Journal of clinical epidemiology.

[13]  John O'Quigley,et al.  Explained randomness in proportional hazards models , 2005, Statistics in medicine.

[14]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[15]  I-Feng Lin,et al.  Shrinkage methods enhanced the accuracy of parameter estimation using Cox models with small number of events. , 2013, Journal of clinical epidemiology.

[16]  E W Steyerberg,et al.  Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. , 1999, Journal of clinical epidemiology.

[17]  R Z Omar,et al.  An evaluation of penalised survival methods for developing prognostic models with rare events , 2012, Statistics in medicine.

[18]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[19]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .