Too many covariates and too few cases? – a comparative study

Prior research indicates that 10-15 cases or controls, whichever fewer, are required per parameter to reliably estimate regression coefficients in multivariable logistic regression models. This condition may be difficult to meet even in a well-designed study when the number of potential confounders is large, the outcome is rare, and/or interactions are of interest. Various propensity score approaches have been implemented when the exposure is binary. Recent work on shrinkage approaches like lasso were motivated by the critical need to develop methods for the p >> n situation, where p is the number of parameters and n is the sample size. Those methods, however, have been less frequently used when p≈n, and in this situation, there is no guidance on choosing among regular logistic regression models, propensity score methods, and shrinkage approaches. To fill this gap, we conducted extensive simulations mimicking our motivating clinical data, estimating vaccine effectiveness for preventing influenza hospitalizations in the 2011-2012 influenza season. Ridge regression and penalized logistic regression models that penalize all but the coefficient of the exposure may be considered in these types of studies. Copyright © 2016 John Wiley & Sons, Ltd.

[1]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[2]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[3]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[4]  Qingxia Chen,et al.  Effectiveness of seasonal vaccine in preventing confirmed influenza-associated hospitalizations in community dwelling older adults. , 2011, The Journal of infectious diseases.

[5]  S. Lahiri,et al.  Bootstrapping Lasso Estimators , 2011 .

[6]  Charles E McCulloch,et al.  Relaxing the rule of ten events per variable in logistic and Cox regression. , 2007, American journal of epidemiology.

[7]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[8]  Qingxia Chen,et al.  Effectiveness of influenza vaccine for preventing laboratory-confirmed influenza hospitalizations in adults, 2011-2012 influenza season. , 2013, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[9]  Donald Rubin,et al.  Estimating Causal Effects from Large Data Sets Using Propensity Scores , 1997, Annals of Internal Medicine.

[10]  R. D'Agostino Adjustment Methods: Propensity Score Methods for Bias Reduction in the Comparison of a Treatment to a Non‐Randomized Control Group , 2005 .

[11]  Peter C Austin,et al.  A comparison of 12 algorithms for matching on the propensity score , 2013, Statistics in medicine.

[12]  Xiao-Hua Zhou,et al.  Generalized propensity score for estimating the average treatment effect of multiple treatments , 2012, Statistics in medicine.

[13]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[14]  Sebastian Schneeweiss,et al.  Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses. , 2015, American journal of epidemiology.

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  Peter C Austin,et al.  The performance of different propensity score methods for estimating marginal hazard ratios , 2007, Statistics in medicine.

[17]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[18]  I-Feng Lin,et al.  Shrinkage methods enhanced the accuracy of parameter estimation using Cox models with small number of events. , 2013, Journal of clinical epidemiology.

[19]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[20]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[21]  Robert Gray,et al.  Flexible Methods for Analyzing Survival Data Using Splines, with Applications to Breast Cancer Prognosis , 1992 .

[22]  Marshall M Joffe,et al.  On the estimation and use of propensity scores in case-control and case-cohort studies. , 2007, American journal of epidemiology.

[23]  M Soledad Cepeda,et al.  Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. , 2003, American journal of epidemiology.

[24]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[25]  Thomas Agoritsas,et al.  Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure. , 2011, Journal of clinical epidemiology.

[26]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[27]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[28]  M B Keller,et al.  A dynamic adaptation of the propensity score adjustment for effectiveness analyses of ordinal doses of treatment , 2001, Statistics in medicine.

[29]  Elizabeth A Stuart,et al.  Improving propensity score weighting using machine learning , 2010, Statistics in medicine.

[30]  S. Vansteelandt,et al.  On regression adjustment for the propensity score , 2014, Statistics in medicine.

[31]  Peter C. Austin,et al.  The Relative Ability of Different Propensity Score Methods to Balance Measured Covariates Between Treated and Untreated Subjects in Observational Studies , 2009, Medical decision making : an international journal of the Society for Medical Decision Making.

[32]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[33]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[34]  Peter C Austin,et al.  Conditioning on the propensity score can result in biased estimation of common measures of treatment effect: a Monte Carlo study , 2007, Statistics in medicine.

[35]  A. Mebazaa,et al.  Propensity score estimators for the average treatment effect and the average treatment effect on the treated may yield very different estimates , 2016, Statistical methods in medical research.

[36]  D. Rubin Propensity score methods. , 2010, American journal of ophthalmology.

[37]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[38]  Peter C Austin,et al.  A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: a Monte Carlo study , 2007, Statistics in medicine.

[39]  D. Rubin,et al.  Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score , 1985 .

[40]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .