No rationale for 1 variable per 10 events criterion for binary logistic regression analysis

BackgroundTen events per variable (EPV) is a widely advocated minimal criterion for sample size considerations in logistic regression analysis. Of three previous simulation studies that examined this minimal EPV criterion only one supports the use of a minimum of 10 EPV. In this paper, we examine the reasons for substantial differences between these extensive simulation studies.MethodsThe current study uses Monte Carlo simulations to evaluate small sample bias, coverage of confidence intervals and mean square error of logit coefficients. Logistic regression models fitted by maximum likelihood and a modified estimation procedure, known as Firth’s correction, are compared.ResultsThe results show that besides EPV, the problems associated with low EPV depend on other factors such as the total sample size. It is also demonstrated that simulation results can be dominated by even a few simulated data sets for which the prediction of the outcome by the covariates is perfect (‘separation’). We reveal that different approaches for identifying and handling separation leads to substantially different simulation results. We further show that Firth’s correction can be used to improve the accuracy of regression coefficients and alleviate the problems associated with separation.ConclusionsThe current evidence supporting EPV rules for binary logistic regression is weak. Given our findings, there is an urgent need for new research to provide guidance for supporting sample size considerations for binary logistic regression analysis.

[1]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[2]  P. McCullagh,et al.  Bias Correction in Generalized Linear Models , 1991 .

[3]  Georg Heinze,et al.  A comparative investigation of methods for logistic regression with separated or nearly separated data , 2006, Statistics in medicine.

[4]  Gareth Ambler,et al.  How to develop a more accurate risk prediction model when there are few events , 2015, BMJ : British Medical Journal.

[5]  J. Habbema,et al.  Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. , 2000, Statistics in medicine.

[6]  Patrick Royston,et al.  Simplifying a prognostic model: a simulation study based on clinical data , 2002, Statistics in medicine.

[7]  Celia M. T. Greenwood,et al.  A modified score function estimator for multinomial logistic regression in small samples , 2002 .

[8]  S. Bull,et al.  Confidence intervals for multinomial logistic regression in sparse data , 2007, Statistics in medicine.

[9]  M. Schemper,et al.  A solution to the problem of separation in logistic regression , 2002, Statistics in medicine.

[10]  Anna Genell,et al.  Bias in odds ratios by logistic regression modelling and sample size , 2009, BMC medical research methodology.

[11]  Yvonne Vergouwe,et al.  Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. , 2005, Journal of clinical epidemiology.

[12]  F. Harrell,et al.  Regression modelling strategies for improved prognostic prediction. , 1984, Statistics in medicine.

[13]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[14]  Charles E McCulloch,et al.  Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[15]  W W Hauck,et al.  Jackknife bias reduction for polychotomous logistic regression. , 1997, Statistics in medicine.

[16]  Emmanuel Lesaffre,et al.  Partial Separation in Logistic Discrimination , 1989 .

[17]  N P Jewell,et al.  Small-sample bias of point estimators of the odds ratio from matched sets. , 1984, Biometrics.

[18]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[19]  D. Firth Bias reduction of maximum likelihood estimates , 1993 .

[20]  G. Collins,et al.  Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist , 2014, PLoS medicine.

[21]  Ewout W Steyerberg,et al.  Logistic regression modeling and the number of events per variable: selection bias dominates. , 2011, Journal of clinical epidemiology.

[22]  J. Gart,et al.  On the bias of various estimators of the logit and its variance with application to quantal bioassay. , 1967, Biometrika.

[23]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[24]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[25]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[26]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[27]  Patrick Royston,et al.  The design of simulation studies in medical statistics , 2006, Statistics in medicine.

[28]  Douglas B. Clarkson,et al.  Computing Extended Maximum Likelihood Estimates for Linear Parameter Models , 1991 .

[29]  Thomas Agoritsas,et al.  Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. , 2011, Journal of clinical epidemiology.