Combining Ensemble Learning Techniques and G-Computation to Investigate Chemical Mixtures in Environmental Epidemiology Studies

Background Although biomonitoring studies demonstrate that the general population experiences exposure to multiple chemicals, most environmental epidemiology studies consider each chemical separately when assessing adverse effects of environmental exposures. Hence, the critical need for novel approaches to handle multiple correlated exposures. Methods We propose a novel approach using the G-formula, a maximum likelihood-based substitution estimator, combined with an ensemble learning technique (i.e. SuperLearner) to infer causal effect estimates for a multi-pollutant mixture. We simulated four continuous outcomes from real data on 5 correlated exposures under four exposure-response relationships with increasing complexity and 500 replications. The first simulated exposure-response was generated as a linear function depending on two exposures; the second was based on a univariate nonlinear exposure-response relationship; the third was generated as a linear exposure-response relationship depending on two exposures and their interaction; the fourth simulation was based on a non-linear exposure-response relationship with an effect modification by sex and a linear relationship with a second exposure. We assessed the method based on its predictive performance (Minimum Square error [MSE]), its ability to detect the true predictors and interactions (i.e. false discovery proportion, sensitivity), and its bias. We compared the method with generalized linear and additive models, elastic net, random forests, and Extreme gradient boosting. Finally, we reconstructed the exposure-response relationships and developed a toolbox for interactions visualization using individual conditional expectations. Results The proposed method yielded the best average MSE across all the scenarios, and was therefore able to adapt to the true underlying structure of the data. The method succeeded to detect the true predictors and interactions, and was less biased in all the scenarios. Finally, we could correctly reconstruct the exposure-response relationships in all the simulations. Conclusions This is the first approach combining ensemble learning techniques and causal inference to unravel the effects of chemical mixtures and their interactions in epidemiological studies. Additional developments including high dimensional exposure data, and testing for detection of low to moderate associations will be carried out in future developments.

[1]  Lars Lind,et al.  The identification of complex interactions in epidemiology and toxicology: a simulation study of boosted regression trees , 2014, Environmental Health.

[2]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[3]  J. Robins,et al.  Estimating causal effects from epidemiological data , 2006, Journal of Epidemiology and Community Health.

[4]  Myoung-Seok Suh,et al.  Development of New Ensemble Methods Based on the Performance Skills of Regional Climate Models over South Korea , 2012 .

[5]  Antoine Chambaz,et al.  Estimation of a non-parametric variable importance measure of a continuous exposure. , 2012, Electronic journal of statistics.

[6]  P. Grandjean,et al.  Children's white blood cell counts in relation to developmental exposures to methylmercury and persistent organic pollutants. , 2016, Reproductive toxicology.

[7]  C. Wild Complementing the Genome with an “Exposome”: The Outstanding Challenge of Environmental Exposure Measurement in Molecular Epidemiology , 2005, Cancer Epidemiology Biomarkers & Prevention.

[8]  John P A Ioannidis,et al.  Placing epidemiological results in the context of multiplicity and typical correlations of exposures , 2014, Journal of Epidemiology & Community Health.

[9]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[10]  Chris E Forest,et al.  Ensemble climate predictions using climate models and observational constraints , 2007, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[11]  Susan Gruber,et al.  Ensemble learning of inverse probability weights for marginal structural modeling in large observational datasets , 2015, Statistics in medicine.

[12]  Mark J van der Laan,et al.  Super Learning: An Application to the Prediction of HIV-1 Drug Resistance , 2007, Statistical applications in genetics and molecular biology.

[13]  Mohammad Ehsanul Karim,et al.  Estimating inverse probability weights using super learner when weight‐model specification is unknown in a marginal structural Cox model context , 2017, Statistics in medicine.

[14]  H. Chipman,et al.  Bayesian Additive Regression Trees , 2006 .

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  James R. Cerhan,et al.  Analysis of Environmental Chemical Mixtures and Non-Hodgkin Lymphoma Risk in the NCI-SEER NHL Study , 2015, Environmental health perspectives.

[17]  J. Danesh,et al.  GUESS-ing Polygenic Associations with Multiple Phenotypes Using a GPU-Based Evolutionary Stochastic Search Algorithm , 2013, PLoS genetics.

[18]  S. Dudoit,et al.  Asymptotics of cross-validated risk estimation in estimator selection and performance assessment , 2005 .

[19]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[20]  Chris Gennings,et al.  Statistical Approaches for Assessing Health Effects of Environmental Chemical Mixtures in Epidemiology: Lessons from an Innovative Workshop , 2016, Environmental health perspectives.

[21]  Wendy S. Parker,et al.  Ensemble modeling, uncertainty and robust predictions , 2013 .

[22]  Yuxia Cui,et al.  Toward Greater Implementation of the Exposome Research Paradigm within Environmental Epidemiology. , 2017, Annual review of public health.

[23]  Emil Pitkin,et al.  Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation , 2013, 1309.6392.

[24]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[25]  Paolo Vineis,et al.  A Systematic Comparison of Linear Regression–Based Statistical Methods to Assess Exposome-Health Associations , 2016, Environmental health perspectives.

[26]  Chirag J. Patel,et al.  Analytic Complexity and Challenges in Identifying Mixtures of Exposures Associated with Phenotypes in the Exposome Era , 2017, Current Epidemiology Reports.

[27]  Liesbeth Bruckers,et al.  Combined Effects of Prenatal Exposures to Environmental Chemicals on Birth Weight , 2016, International journal of environmental research and public health.

[28]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[29]  Atul J. Butte,et al.  An Environment-Wide Association Study (EWAS) on Type 2 Diabetes Mellitus , 2010, PloS one.

[30]  Chris Gennings,et al.  A Cohort study evaluation of maternal PCB exposure related to time to pregnancy in daughters , 2013, Environmental Health.

[31]  Peter C. Austin,et al.  Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation , 2012, Multivariate behavioral research.

[32]  Alan Hubbard,et al.  Variable Importance and Prediction Methods for Longitudinal Problems with Missing Variables , 2015, PloS one.

[33]  Richard Grieve,et al.  Evaluation of the Effect of a Continuous Treatment: A Machine Learning Approach with an Application to Treatment for Traumatic Brain Injury , 2015, Health economics.

[34]  Bhramar Mukherjee,et al.  Environmental Risk Score as a New Tool to Examine Multi-Pollutants in Epidemiologic Research: An Example from the NHANES Study Using Serum Lipid Levels , 2014, PloS one.

[35]  M. J. van der Laan,et al.  Practice of Epidemiology Improving Propensity Score Estimators ’ Robustness to Model Misspecification Using Super Learner , 2015 .

[36]  Ken Sexton,et al.  Cumulative risk assessment for combined health effects from chemical and nonchemical stressors. , 2011, American journal of public health.

[37]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[38]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[39]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[40]  David C Christiani,et al.  Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. , 2015, Biostatistics.

[41]  Aldert H Piersma,et al.  Prenatal Phthalate, Perfluoroalkyl Acid, and Organochlorine Exposures and Term Birth Weight in Three Birth Cohorts: Multi-Pollutant Models Based on Elastic Net Regression , 2015, Environmental health perspectives.