A Targeted Approach to Confounder Selection for High-Dimensional Data

We consider the problem of selecting confounders for adjustment from a potentially large set of covariates, when estimating a causal effect. Recently, the high-dimensional Propensity Score (hdPS) method was developed for this task; hdPS ranks potential confounders by estimating an importance score for each variable and selects the top few variables. However, this ranking procedure is limited: it requires all variables to be binary. We propose an extension of the hdPS to general types of response and confounder variables. We further develop a group importance score, allowing us to rank groups of potential confounders. The main challenge is that our parameter requires either the propensity score or response model; both vulnerable to model misspecification. We propose a targeted maximum likelihood estimator (TMLE) which allows the use of nonparametric, machine learning tools for fitting these intermediate models. We establish asymptotic normality of our estimator, which consequently allows constructing confidence intervals. We complement our work with numerical studies on simulated and real data. Keywords— Causal inference, Confounder selection, High-dimensional data, Targeted maximum likelihood estimation, High-dimensional propensity score

[1]  S. Schneeweiss,et al.  Variation in the risk of suicide attempts and completed suicides by antidepressant agent in adults: a propensity score-adjusted analysis of 9 years' data. , 2010, Archives of general psychiatry.

[2]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[3]  K. Huybrechts,et al.  Risk of death and hospital admission for major medical events after initiation of psychotropic medications in older adults admitted to nursing homes , 2011, Canadian Medical Association Journal.

[4]  Sebastian Schneeweiss,et al.  Comparative Safety of Antidepressant Agents for Children and Adolescents Regarding Suicidal Acts , 2010, Pediatrics.

[5]  J. Robins Data, Design, and Background Knowledge in Etiologic Inference , 2001, Epidemiology.

[6]  Bogdan E. Popescu,et al.  PREDICTIVE LEARNING VIA RULE ENSEMBLES , 2008, 0811.1679.

[7]  S. Cole,et al.  Overadjustment Bias and Unnecessary Adjustment in Epidemiologic Studies , 2009, Epidemiology.

[8]  E. Garbe,et al.  The Potential of High‐Dimensional Propensity Scores in Health Services Research: An Exemplary Study on the Quality of Care for Elective Percutaneous Coronary Interventions , 2018, Health services research.

[9]  Sebastian Schneeweiss,et al.  Regularized Regression Versus the High-Dimensional Propensity Score for Confounding Adjustment in Secondary Database Analyses. , 2015, American journal of epidemiology.

[10]  Robert W. Platt,et al.  On the role of marginal confounder prevalence – implications for the high‐dimensional propensity score algorithm , 2015, Pharmacoepidemiology and drug safety.

[11]  J. Avorn,et al.  Anticonvulsant medications and the risk of suicide, attempted suicide, or violent death. , 2010, Journal of the American Medical Association (JAMA).

[12]  Mohammad Ehsanul Karim,et al.  Can We Train Machine Learning Methods to Outperform the High-dimensional Propensity Score Algorithm? , 2017, Epidemiology.

[13]  Mark J van der Laan,et al.  An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics , 2010, The international journal of biostatistics.

[14]  Sebastian Schneeweiss,et al.  Cardiovascular Outcomes and Mortality in Patients Using Clopidogrel With Proton Pump Inhibitors After Percutaneous Coronary Intervention or Acute Coronary Syndrome , 2009, Circulation.

[15]  Ashkan Ertefaie,et al.  Outcome‐adaptive lasso: Variable selection for causal inference , 2017, Biometrics.

[16]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[17]  J. Avorn,et al.  High-dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data , 2009, Epidemiology.

[18]  Sengwee Toh,et al.  Confounding adjustment via a semi‐automated high‐dimensional propensity score algorithm: an application to electronic medical records , 2011, Pharmacoepidemiology and drug safety.

[19]  Antoine Chambaz,et al.  Scalable collaborative targeted learning for high-dimensional data , 2017, Statistical methods in medical research.

[20]  J. Rassen,et al.  Confounding Control in Healthcare Database Research: Challenges and Potential Approaches , 2010, Medical care.

[21]  M. J. van der Laan,et al.  The International Journal of Biostatistics Collaborative Double Robust Targeted Maximum Likelihood Estimation , 2011 .

[22]  Kjell A. Doksum,et al.  Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression , 1995 .

[23]  Jinbo Bi,et al.  Dimensionality Reduction via Sparse Support Vector Machines , 2003, J. Mach. Learn. Res..

[24]  J. Myers,et al.  Effects of adjusting for instrumental variables on bias and precision of effect estimates. , 2011, American journal of epidemiology.

[25]  Marco Carone,et al.  Nonparametric variable importance assessment using machine learning techniques , 2020, Biometrics.

[26]  Stephen MacMahon,et al.  Reliable assessment of the effects of treatment on mortality and major morbidity, II: observational studies , 2001, The Lancet.

[27]  P. Bickel,et al.  Sex Bias in Graduate Admissions: Data from Berkeley , 1975, Science.

[28]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[29]  I. Bross Spurious effects from an extraneous variable. , 1966, Journal of chronic diseases.

[30]  A. Perry The Devaluation of Assets in Black Neighborhoods: How Racism Robs Homeowner of the American Dream , 2019 .

[31]  Robert W. Platt,et al.  Targeted Maximum Likelihood Estimation for Pharmacoepidemiologic Research , 2016, Epidemiology.

[32]  Antoine Chambaz,et al.  Estimation of a non-parametric variable importance measure of a continuous exposure. , 2012, Electronic journal of statistics.

[33]  Cheng Ju,et al.  Propensity score prediction for electronic healthcare databases using super learner and high-dimensional propensity score methods , 2017, Journal of applied statistics.

[34]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[35]  Jacques LeLorier,et al.  Head to head comparison of the propensity score and the high-dimensional propensity score matching methods , 2016, BMC Medical Research Methodology.

[36]  Jennifer M. Polinski,et al.  Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases , 2014, Comput. Stat. Data Anal..

[37]  Sebastian Schneeweiss,et al.  Variable Selection for Confounding Adjustment in High-dimensional Covariate Spaces When Analyzing Healthcare Databases , 2017, Epidemiology.

[38]  A. Rotnitzky,et al.  A note on overadjustment in inverse probability weighted estimation. , 2010, Biometrika.

[39]  Sander Greenland,et al.  Invited commentary: variable selection versus shrinkage in the control of multiple confounders. , 2007, American journal of epidemiology.

[40]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[41]  M Alan Brookhart,et al.  Covariate selection in high-dimensional propensity score analyses of treatment effects in small samples. , 2011, American journal of epidemiology.

[42]  Susan Hutfless,et al.  Mining high-dimensional administrative claims data to predict early hospital readmissions , 2014, J. Am. Medical Informatics Assoc..

[43]  Sebastian Schneeweiss,et al.  Using high‐dimensional propensity scores to automate confounding control in a distributed medical product safety surveillance system , 2012, Pharmacoepidemiology and drug safety.

[44]  M Alan Brookhart,et al.  The implications of propensity score variable selection strategies in pharmacoepidemiology: an empirical illustration , 2011, Pharmacoepidemiology and drug safety.

[45]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[46]  Peter C Austin,et al.  Comparing the performance of propensity score methods in healthcare database studies with rare outcomes , 2017, Statistics in medicine.

[47]  T. Schuster,et al.  Effect Estimation in Point-Exposure Studies with Binary Outcomes and High-Dimensional Covariate Data – A Comparison of Targeted Maximum Likelihood Estimation and Inverse Probability of Treatment Weighting , 2016, The international journal of biostatistics.