A Machine-Learning Algorithm for Estimating and Ranking the Impact of Environmental Risk Factors in Exploratory Epidemiological Studies

Epidemiological research, such as the identification of disease risks attributable to environmental chemical exposures, is often hampered by small population effects, large measurement error, and limited a priori knowledge regarding the complex relationships between the many chemicals under study. However, even an ideal study design does not preclude the possibility of reported false positive exposure effects due to inappropriate statistical methodology. Three issues often overlooked include (1) definition of a meaningful measure of association; (2) use of model estimation strategies (such as machine-learning) that acknowledge that the true data-generating model is unknown; (3) accounting for multiple testing. In this paper, we propose an algorithm designed to address each of these limitations in turn by combining recent advances in the causal inference and multiple-testing literature along with modifications to traditional nonparametric inference methods.

[1]  K J Rothman,et al.  No Adjustments Are Needed for Multiple Comparisons , 1990, Epidemiology.

[2]  Alan E Hubbard,et al.  Population intervention models in causal inference. , 2008, Biometrika.

[3]  S. Dudoit,et al.  Multiple Testing. Part III. Procedures for Control of the Generalized Family-Wise Error Rate and Proportion of False Positives , 2004 .

[4]  M. J. Laan Causal Effect Models for Intention to Treat and Realistic Individualized Treatment Rules , 2006 .

[5]  M. Petersen,et al.  Diagnosing Bias in the Inverse Probability of Treatment Weighted Estimator Resulting from Violation of Experimental Treatment Assignment , 2006 .

[6]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[7]  Sandrine Dudoit,et al.  Multiple Testing. Part II. Step-Down Procedures for Control of the Family-Wise Error Rate , 2004, Statistical applications in genetics and molecular biology.

[8]  M. J. van der Laan,et al.  A Comparison of Methods to Control Type I Errors in Microarray Studies , 2007, Statistical applications in genetics and molecular biology.

[9]  Scott Clark,et al.  Imputation of Data Values That are Less Than a Detection Limit , 2004, Journal of occupational and environmental hygiene.

[10]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[11]  A. Hubbard,et al.  Association of In Utero Organochlorine Pesticide Exposure and Fetal Growth and Length of Gestation in an Agricultural Population , 2005, Environmental health perspectives.

[12]  P. Rosenbaum Conditional Permutation Tests and the Propensity Score in Observational Studies , 1984 .

[13]  N. Jewell,et al.  In Utero Exposure to Dichlorodiphenyltrichloroethane (DDT) and Dichlorodiphenyldichloroethylene (DDE) and Neurodevelopment Among Young Mexican American Children , 2006, Pediatrics.

[14]  N. Holland,et al.  Effects of exposure to polychlorinated biphenyls and organochlorine pesticides on thyroid function during pregnancy. , 2008, American journal of epidemiology.

[15]  S. Vansteelandt,et al.  Marginal structural models for partial exposure regimes. , 2008, Biostatistics.

[16]  Learning From Data: Semiparametric Models Versus Faith-based Inference , 2010 .

[17]  Mark J. van der Laan,et al.  Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms in Estimation , 2004 .

[18]  K. Pollard,et al.  Resampling-based Multiple Testing: Asymptotic Control of Type I Error and Applications to Gene Expression Data , 2003 .

[19]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[20]  Atul J. Butte,et al.  An Environment-Wide Association Study (EWAS) on Type 2 Diabetes Mellitus , 2010, PloS one.

[21]  Sandrine Dudoit,et al.  Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates , 2004, Statistical applications in genetics and molecular biology.

[22]  D. Rubin Statistics and Causal Inference: Comment: Which Ifs Have Causal Answers , 1986 .

[23]  N. Jewell,et al.  Association of in Utero Organophosphate Pesticide Exposure and Fetal Growth and Length of Gestation in an Agricultural Population , 2004, Environmental health perspectives.

[24]  N. Jewell,et al.  The Impact Of Coarsening The Explanatory Variable Of Interest In Making Causal Inferences: Implicit Assumptions Behind Dichotomizing Variables , 2010 .

[25]  M. J. van der Laan,et al.  The International Journal of Biostatistics Collaborative Double Robust Targeted Maximum Likelihood Estimation , 2011 .

[26]  M. J. van der Laan,et al.  The International Journal of Biostatistics Targeted Maximum Likelihood Learning , 2011 .

[27]  Alan E. Hubbard,et al.  Statistical Applications in Genetics and Molecular Biology Quantile-Function Based Null Distribution in Resampling Based Multiple Testing , 2011 .

[28]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .