A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure

We often seek to estimate the impact of an exposure naturally occurring or randomly assigned at the cluster-level. For example, the literature on neighborhood determinants of health continues to grow. Likewise, community randomized trials are applied to learn about real-world implementation, sustainability, and population effects of interventions with proven individual-level efficacy. In these settings, individual-level outcomes are correlated due to shared cluster-level factors, including the exposure, as well as social or biological interactions between individuals. To flexibly and efficiently estimate the effect of a cluster-level exposure, we present two targeted maximum likelihood estimators (TMLEs). The first TMLE is developed under a non-parametric causal model, which allows for arbitrary interactions between individuals within a cluster. These interactions include direct transmission of the outcome (i.e. contagion) and influence of one individual’s covariates on another’s outcome (i.e. covariate interference). The second TMLE is developed under a causal sub-model assuming the cluster-level and individual-specific covariates are sufficient to control for confounding. Simulations compare the alternative estimators and illustrate the potential gains from pairing individual-level risk factors and outcomes during estimation, while avoiding unwarranted assumptions. Our results suggest that estimation under the sub-model can result in bias and misleading inference in an observational setting. Incorporating working assumptions during estimation is more robust than assuming they hold in the underlying causal model. We illustrate our approach with an application to HIV prevention and treatment.

[1]  M. Petersen,et al.  Sustainable East Africa Research in Community Health (SEARCH): a community cluster randomized study of HIV "test and treat" using multi-disease approach in rural Uganda and Kenya , 2018, 1808.03231.

[2]  Mark J. van der Laan,et al.  ltmle: An R Package Implementing Targeted Minimum Loss-Based Estimation for Longitudinal Data , 2017 .

[3]  Forrest W. Crawford,et al.  Risk ratios for contagious outcomes , 2017, Journal of The Royal Society Interface.

[4]  Mark J van der Laan,et al.  Adaptive pre‐specification in randomized trials with and without pair‐matching , 2016, Statistics in medicine.

[5]  S. Lockman,et al.  Botswana's progress toward achieving the 2020 UNAIDS 90-90-90 antiretroviral therapy and virological suppression goals: a population-based survey. , 2016, The lancet. HIV.

[6]  M. Petersen,et al.  A hybrid mobile HIV testing approach for population-wide HIV testing in rural East Africa: an observational study , 2016, The lancet. HIV.

[7]  H. Lane,et al.  Initiation of Antiretroviral Therapy in Early Asymptomatic HIV Infection. , 2015, The New England journal of medicine.

[8]  R. Salamon,et al.  A Trial of Early Antiretrovirals and Isoniazid Preventive Therapy in Africa. , 2015, The New England journal of medicine.

[9]  Rui Wang,et al.  Accounting for interactions and complex inter‐subject dependency in estimating treatment effect in cluster‐randomized trials with missing outcomes , 2015, Biometrics.

[10]  Mark J. van der Laan,et al.  Entering the Era of Data Science: Targeted Learning and the Integration of Statistics and Computational Data Analysis , 2014 .

[11]  Andrew Copas,et al.  Review of methods for handling confounding by cluster and informative cluster size in clustered data , 2014, Statistics in medicine.

[12]  Mark J van der Laan,et al.  EFFECT OF BREASTFEEDING ON GASTROINTESTINAL INFECTION IN INFANTS: A TARGETED MAXIMUM LIKELIHOOD APPROACH FOR CLUSTERED LONGITUDINAL DATA. , 2014, The annals of applied statistics.

[13]  Lingling Li,et al.  Inverse probability weighting for covariate adjustment in randomized studies , 2014, Statistics in medicine.

[14]  Mark J. van der Laan,et al.  tmle : An R Package for Targeted Maximum Likelihood Estimation , 2012 .

[15]  Mark J van der Laan,et al.  Targeted Minimum Loss Based Estimator that Outperforms a given Estimator , 2012, The international journal of biostatistics.

[16]  Alisa J. Stephens,et al.  Augmented generalized estimating equations for improving efficiency and validity of estimation in cluster randomized trials by leveraging cluster‐level and individual‐level covariates , 2012, Statistics in medicine.

[17]  S. Swindells,et al.  Prevention of HIV-1 infection with early antiretroviral therapy , 2011, Journal of Family Planning and Reproductive Health Care.

[18]  Jessica G. Young,et al.  Comparative Effectiveness of Dynamic Treatment Regimes: An Application of the Parametric G-Formula , 2011, Statistics in biosciences.

[19]  M. J. Laan,et al.  Targeted Learning: Causal Inference for Observational and Experimental Data , 2011 .

[20]  Sherri Rose,et al.  Implementation of G-computation on a simulated data set: demonstration of a causal inference technique. , 2011, American journal of epidemiology.

[21]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  B. Vissel,et al.  A Study of Clustered Data and Approaches to Its Analysis , 2010, The Journal of Neuroscience.

[23]  N. Jewell,et al.  To GEE or Not to GEE: Comparing Population Average and Mixed Models for Estimating the Associations Between Neighborhood Risk Factors and Health , 2010, Epidemiology.

[24]  Michael Rosenblum,et al.  The International Journal of Biostatistics Simple , Efficient Estimators of Treatment Effects in Randomized Trials Using Generalized Linear Models to Leverage Baseline Variables , 2011 .

[25]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[26]  J. Robins,et al.  Intervening on risk factors for coronary heart disease: an application of the parametric g-formula. , 2009, International journal of epidemiology.

[27]  Zhehui Luo,et al.  Fixed effects, random effects and GEE: What are the differences? , 2009, Statistics in medicine.

[28]  M J van der Laan,et al.  Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation , 2009, Statistics in medicine.

[29]  M. Davidian,et al.  Covariate adjustment for two‐sample treatment comparisons in randomized clinical trials: A principled yet flexible approach , 2008, Statistics in medicine.

[30]  M. Hudgens,et al.  Toward Causal Inference With Interference , 2008, Journal of the American Statistical Association.

[31]  Jasjeet S. Sekhon,et al.  Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R , 2008 .

[32]  M. J. van der Laan,et al.  The International Journal of Biostatistics Targeted Maximum Likelihood Learning , 2011 .

[33]  Michael E. Sobel,et al.  What Do Randomized Studies of Housing Mobility Demonstrate? , 2006 .

[34]  J. Robins,et al.  Estimating causal effects from epidemiological data , 2006, Journal of Epidemiology and Community Health.

[35]  J. Robins,et al.  Doubly Robust Estimation in Missing Data and Causal Inference Models , 2005, Biometrics.

[36]  M. Davidian,et al.  Marginal structural models for analyzing causal effects of time-dependent treatments: an application in perinatal epidemiology. , 2004, American journal of epidemiology.

[37]  J. M. Oakes,et al.  The (mis)estimation of neighborhood effects: causal inference for a practicable social epidemiology. , 2004, Social science & medicine.

[38]  L. Berkman,et al.  Neighborhoods and health , 2003 .

[39]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[40]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[41]  J. Robins,et al.  Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. , 2000, Epidemiology.

[42]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[43]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[44]  M. Halloran,et al.  Causal Inference in Infectious Diseases , 1995, Epidemiology.

[45]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[46]  J. Robins,et al.  Estimation of Regression Coefficients When Some Regressors are not Always Observed , 1994 .

[47]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[48]  M E Halloran,et al.  Study designs for dependent happenings. , 1991, Epidemiology.

[49]  Robert M. May,et al.  Infectious Diseases of Humans: Dynamics and Control , 1991 .

[50]  Stefan Sperlich,et al.  Generalized Additive Models , 2014 .

[51]  S. Zeger,et al.  Longitudinal data analysis using generalized linear models , 1986 .

[52]  J. Ware,et al.  Random-effects models for longitudinal data. , 1982, Biometrics.

[53]  P. McCullagh,et al.  Some aspects of analysis of covariance. , 1982, Biometrics.

[54]  D. Rubin Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment , 1980 .

[55]  D. R. Cox,et al.  Planning of Experiments , 1959 .

[56]  W. G. Cochran Analysis of covariance: Its nature and uses. , 1957 .

[57]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[58]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[59]  Alisa J. Stephens,et al.  Locally Efficient Estimation of Marginal Treatment Effects When Outcomes Are Correlated: Is the Prize Worth the Chase? , 2014, The international journal of biostatistics.

[60]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[61]  J. Chemali,et al.  Summary and discussion of “ The central role of the propensity score in observational studies for causal effects , 2014 .

[62]  M. Laan,et al.  Estimating the impact of community-level interventions: The SEARCH Trial and HIV Prevention in Sub-Saharan Africa , 2012 .

[63]  Mark J van der Laan,et al.  The International Journal of Biostatistics A Targeted Maximum Likelihood Estimator of a Causal Effect on a Bounded Continuous Outcome , 2011 .

[64]  Mark J van der Laan,et al.  An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics , 2010, The international journal of biostatistics.

[65]  Richard J. Hayes,et al.  Cluster randomised trials , 2009 .

[66]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[67]  G. Imbens,et al.  Large Sample Properties of Matching Estimators for Average Treatment Effects , 2004 .

[68]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[69]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[70]  D. Wolpert Stacked generalization , 1992, Neural Networks.

[71]  J. Robins,et al.  Recovery of Information and Adjustment for Dependent Censoring Using Surrogate Markers , 1992 .

[72]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[73]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .