A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure

We often seek to estimate the impact of an exposure naturally occurring or randomly assigned at the cluster-level. For example, the literature on neighborhood determinants of health continues to grow. Likewise, community randomized trials are applied to learn about real-world implementation, sustainability, and population effects of interventions with proven individual-level efficacy. In these settings, individual-level outcomes are correlated due to shared cluster-level factors, including the exposure, as well as social or biological interactions between individuals. To flexibly and efficiently estimate the effect of a cluster-level exposure, we present two targeted maximum likelihood estimators (TMLEs). The first TMLE is developed under a non-parametric causal model, which allows for arbitrary interactions between individuals within a cluster. These interactions include direct transmission of the outcome (i.e. contagion) and influence of one individual’s covariates on another’s outcome (i.e. covariate interference). The second TMLE is developed under a causal sub-model assuming the cluster-level and individual-specific covariates are sufficient to control for confounding. Simulations compare the alternative estimators and illustrate the potential gains from pairing individual-level risk factors and outcomes during estimation, while avoiding unwarranted assumptions. Our results suggest that estimation under the sub-model can result in bias and misleading inference in an observational setting. Incorporating working assumptions during estimation is more robust than assuming they hold in the underlying causal model. We illustrate our approach with an application to HIV prevention and treatment.

[1]  Jasjeet S. Sekhon,et al.  Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R , 2008 .

[2]  M E Halloran,et al.  Study designs for dependent happenings. , 1991, Epidemiology.

[3]  Sherri Rose,et al.  Implementation of G-computation on a simulated data set: demonstration of a causal inference technique. , 2011, American journal of epidemiology.

[4]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[5]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[6]  W. G. Cochran Analysis of covariance: Its nature and uses. , 1957 .

[7]  Andrew Copas,et al.  Review of methods for handling confounding by cluster and informative cluster size in clustered data , 2014, Statistics in medicine.

[8]  J. Robins,et al.  Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. , 2009, International journal of epidemiology.

[9]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[10]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[11]  H. Lane,et al.  Initiation of Antiretroviral Therapy in Early Asymptomatic HIV Infection. , 2015, The New England journal of medicine.

[12]  A. D. Diez Roux,et al.  Neighborhoods and health , 2010, Annals of the New York Academy of Sciences.

[13]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[14]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[15]  Zhehui Luo,et al.  Fixed effects, random effects and GEE: What are the differences? , 2009, Statistics in medicine.

[16]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[17]  P. Kaye Infectious diseases of humans: Dynamics and control , 1993 .

[18]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[19]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[20]  S. Lockman,et al.  Botswana's progress toward achieving the 2020 UNAIDS 90-90-90 antiretroviral therapy and virological suppression goals: a population-based survey. , 2016, The lancet. HIV.

[21]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  R. Salamon,et al.  A Trial of Early Antiretrovirals and Isoniazid Preventive Therapy in Africa. , 2015, The New England journal of medicine.

[23]  Eric J Tchetgen Tchetgen,et al.  Augmented generalized estimating equations for improving efficiency and validity of estimation in cluster randomized trials by leveraging cluster‐level and individual‐level covariates , 2012, Statistics in medicine.

[24]  Mark J van der Laan,et al.  Adaptive pre‐specification in randomized trials with and without pair‐matching , 2016, Statistics in medicine.

[25]  Mark J van der Laan,et al.  An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics , 2010, The international journal of biostatistics.

[26]  J. Ware,et al.  Random-effects models for longitudinal data. , 1982, Biometrics.

[27]  Forrest W. Crawford,et al.  Risk ratios for contagious outcomes , 2017, Journal of The Royal Society Interface.

[28]  Mark J van der Laan,et al.  EFFECT OF BREASTFEEDING ON GASTROINTESTINAL INFECTION IN INFANTS: A TARGETED MAXIMUM LIKELIHOOD APPROACH FOR CLUSTERED LONGITUDINAL DATA. , 2014, The annals of applied statistics.

[29]  P. McCullagh,et al.  Some aspects of analysis of covariance. , 1982, Biometrics.

[30]  M J van der Laan,et al.  Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation , 2009, Statistics in medicine.

[31]  Joel E. Cohen,et al.  Infectious Diseases of Humans: Dynamics and Control , 1992 .

[32]  D. V. Lindley,et al.  Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment , 1980 .

[33]  M. Davidian,et al.  Marginal structural models for analyzing causal effects of time-dependent treatments: an application in perinatal epidemiology. , 2004, American journal of epidemiology.

[34]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[35]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[36]  Mark J. van der Laan,et al.  Entering the Era of Data Science: Targeted Learning and the Integration of Statistics and Computational Data Analysis , 2014 .

[37]  J. M. Oakes,et al.  The (mis)estimation of neighborhood effects: causal inference for a practicable social epidemiology. , 2004, Social science & medicine.

[38]  Lingling Li,et al.  Inverse probability weighting for covariate adjustment in randomized studies , 2014, Statistics in medicine.

[39]  M. Hudgens,et al.  Toward Causal Inference With Interference , 2008, Journal of the American Statistical Association.

[40]  James M. Robins,et al.  Comparative Effectiveness of Dynamic Treatment Regimes: An Application of the Parametric G-Formula , 2011, Statistics in biosciences.

[41]  I NICOLETTI,et al.  The Planning of Experiments , 1936, Rivista di clinica pediatrica.

[42]  R. Fisher Statistical Methods for Research Workers , 1971 .

[43]  M. Davidian,et al.  Covariate adjustment for two‐sample treatment comparisons in randomized clinical trials: A principled yet flexible approach , 2008, Statistics in medicine.

[44]  J. Robins,et al.  Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. , 2000, Epidemiology.

[45]  Alisa J. Stephens,et al.  Locally Efficient Estimation of Marginal Treatment Effects When Outcomes Are Correlated: Is the Prize Worth the Chase? , 2014, The international journal of biostatistics.

[46]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[47]  Rui Wang,et al.  Accounting for interactions and complex inter‐subject dependency in estimating treatment effect in cluster‐randomized trials with missing outcomes , 2015, Biometrics.

[48]  Mark J van der Laan,et al.  Targeted Minimum Loss Based Estimator that Outperforms a given Estimator , 2012, The international journal of biostatistics.

[49]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[50]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[51]  N. Jewell,et al.  To GEE or Not to GEE: Comparing Population Average and Mixed Models for Estimating the Associations Between Neighborhood Risk Factors and Health , 2010, Epidemiology.

[52]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[53]  Sally Galbraith,et al.  A Study of Clustered Data and Approaches to Its Analysis , 2010, The Journal of Neuroscience.

[54]  Michael Rayment,et al.  Prevention of HIV-1 infection with early antiretroviral therapy , 2012, Journal of Family Planning and Reproductive Health Care.

[55]  M. Laan,et al.  Estimating the impact of community-level interventions: The SEARCH Trial and HIV Prevention in Sub-Saharan Africa , 2012 .

[56]  Mark J. van der Laan,et al.  tmle : An R Package for Targeted Maximum Likelihood Estimation , 2012 .

[57]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[58]  J. Robins,et al.  Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers , 1992 .

[59]  M. J. van der Laan,et al.  The International Journal of Biostatistics Targeted Maximum Likelihood Learning , 2011 .

[60]  M. Halloran,et al.  Causal Inference in Infectious Diseases , 1995, Epidemiology.

[61]  J. Robins,et al.  Estimating causal effects from epidemiological data , 2006, Journal of Epidemiology and Community Health.

[62]  Michael E. Sobel,et al.  What Do Randomized Studies of Housing Mobility Demonstrate? , 2006 .

[63]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[64]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[65]  G. Imbens,et al.  Large Sample Properties of Matching Estimators for Average Treatment Effects , 2004 .

[66]  M. Petersen,et al.  A hybrid mobile HIV testing approach for population-wide HIV testing in rural East Africa: an observational study , 2016, The lancet. HIV.

[67]  Michael Rosenblum,et al.  The International Journal of Biostatistics Simple , Efficient Estimators of Treatment Effects in Randomized Trials Using Generalized Linear Models to Leverage Baseline Variables , 2011 .

[68]  Mark J. van der Laan,et al.  ltmle: An R Package Implementing Targeted Minimum Loss-Based Estimation for Longitudinal Data , 2017 .

[69]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[70]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[71]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[72]  Mark J van der Laan,et al.  The International Journal of Biostatistics A Targeted Maximum Likelihood Estimator of a Causal Effect on a Bounded Continuous Outcome , 2011 .