Variable Selection for Confounding Adjustment in High-dimensional Covariate Spaces When Analyzing Healthcare Databases

Background: Data-adaptive approaches to confounding adjustment may improve performance beyond expert knowledge when analyzing electronic healthcare databases and have additional practical advantages for analyzing multiple databases in rapid cycles. Improvements seemed possible if outcome predictors were reliably identified empirically and adjusted. Methods: In five cohort studies from diverse healthcare databases, we implemented a base-case high-dimensional propensity score algorithm with propensity score decile-adjusted outcome models to estimate treatment effects among prescription drug initiators. The original variable selection procedure based on the estimated bias of each variable using unadjusted associations between confounders and exposure (RRCE) and disease outcome (RRCD) was augmented by alternative strategies. These included using increasingly adjusted RRCD estimates, including models considering >1,500 variables jointly (Lasso, Bayesian logistic regression); using prediction statistics or likelihood-ratio statistics for covariate prioritization; directly estimating the propensity score with >1,500 variables (Lasso, Bayesian regression); or directly fitting an outcome model using all covariates jointly (Lasso, Ridge). Results: In five example studies, most tested augmentations of the base-case hdPS did not meaningfully change estimates in light of wide confidence intervals except for Bayesian regression and Lasso to estimate RRCD, which moved estimates minimally closer to the expectation in three of five examples. The direct outcome estimation with Lasso performed worst. Conclusion: Overall, the basic heuristic of variable reduction in high-dimensional propensity score adjustment performed, as well as alternative approaches in diverse settings. Minor improvements in variable selection may be possible using Bayesian outcome regression to prioritize variables for propensity score estimation when outcomes are rare. See video abstract at, http://links.lww.com/EDE/B162.

[1]  Robert W. Platt,et al.  On the role of marginal confounder prevalence – implications for the high‐dimensional propensity score algorithm , 2015, Pharmacoepidemiology and drug safety.

[2]  J. Myers,et al.  Effects of adjusting for instrumental variables on bias and precision of effect estimates. , 2011, American journal of epidemiology.

[3]  Sander Greenland,et al.  Invited commentary: variable selection versus shrinkage in the control of multiple confounders. , 2007, American journal of epidemiology.

[4]  Sebastian Schneeweiss,et al.  Using high‐dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system , 2012, Pharmacoepidemiology and drug safety.

[5]  Jun Liu,et al.  Studies with Many Covariates and Few Outcomes: Selecting Covariates and Implementing Propensity-Score–Based Confounding Adjustments , 2014, Epidemiology.

[6]  J. Avorn,et al.  High-dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data , 2009, Epidemiology.

[7]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  A. Feinstein XI. Sources of ‘chronology bias’ in cohort statistics , 1971, Clinical pharmacology and therapeutics.

[10]  M. J. van der Laan,et al.  The International Journal of Biostatistics Collaborative Double Robust Targeted Maximum Likelihood Estimation , 2011 .

[11]  Jennifer M. Polinski,et al.  Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases , 2014, Comput. Stat. Data Anal..

[12]  M. G. Pittau,et al.  A weakly informative default prior distribution for logistic and other regression models , 2008, 0901.4011.

[13]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[14]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[15]  M Soledad Cepeda,et al.  Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. , 2003, American journal of epidemiology.

[16]  J. Schafer,et al.  Average causal effects from nonrandomized studies: a practical guide and simulated example. , 2008, Psychological methods.

[17]  I. Bross Spurious effects from an extraneous variable. , 1966, Journal of chronic diseases.

[18]  J. Avorn,et al.  Anticonvulsant medications and the risk of suicide, attempted suicide, or violent death. , 2010, Journal of the American Medical Association (JAMA).

[19]  J. Avorn,et al.  Treatment effects in the presence of unmeasured confounding: dealing with observations in the tails of the propensity score distribution--a simulation study. , 2010, American journal of epidemiology.

[20]  S. Vansteelandt,et al.  On model selection and model misspecification in causal inference , 2012, Statistical methods in medical research.

[21]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[22]  Til Stürmer,et al.  A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. , 2006, Journal of clinical epidemiology.

[23]  J. Avorn,et al.  Variable selection for propensity score models. , 2006, American journal of epidemiology.

[24]  M Alan Brookhart,et al.  Evaluating Short-Term Drug Effects Using a Physician-Specific Prescribing Preference as an Instrumental Variable , 2006, Epidemiology.

[25]  A. H. Mack Comparative Safety of Antidepressant Agents for Children and Adolescents Regarding Suicidal Acts , 2011 .

[26]  Erika Cule,et al.  Ridge Regression in Prediction Problems: Automatic Choice of the Ridge Parameter , 2013, Genetic epidemiology.

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  J. Robins,et al.  Estimating exposure effects by modelling the expectation of exposure conditional on confounders. , 1992, Biometrics.

[29]  Udaya B. Kogalur,et al.  High-Dimensional Variable Selection for Survival Data , 2010 .

[30]  Corwin M Zigler,et al.  Model Feedback in Bayesian Propensity Score Estimation , 2013, Biometrics.

[31]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[32]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[33]  T. Richardson,et al.  Covariate selection for the nonparametric estimation of an average treatment effect , 2011 .

[34]  Sebastian Schneeweiss,et al.  Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses. , 2015, American journal of epidemiology.