Classification and regression trees for epidemiologic research: an air pollution example

BackgroundIdentifying and characterizing how mixtures of exposures are associated with health endpoints is challenging. We demonstrate how classification and regression trees can be used to generate hypotheses regarding joint effects from exposure mixtures.MethodsWe illustrate the approach by investigating the joint effects of CO, NO2, O3, and PM2.5 on emergency department visits for pediatric asthma in Atlanta, Georgia. Pollutant concentrations were categorized as quartiles. Days when all pollutants were in the lowest quartile were held out as the referent group (n = 131) and the remaining 3,879 days were used to estimate the regression tree. Pollutants were parameterized as dichotomous variables representing each ordinal split of the quartiles (e.g. comparing CO quartile 1 vs. CO quartiles 2–4) and considered one at a time in a Poisson case-crossover model with control for confounding. The pollutant-split resulting in the smallest P- value was selected as the first split and the dataset was partitioned accordingly. This process repeated for each subset of the data until the P- values for the remaining splits were not below a given alpha, resulting in the formation of a “terminal node”. We used the case-crossover model to estimate the adjusted risk ratio for each terminal node compared to the referent group, as well as the likelihood ratio test for the inclusion of the terminal nodes in the final model.ResultsThe largest risk ratio corresponded to days when PM2.5 was in the highest quartile and NO2 was in the lowest two quartiles (RR: 1.10, 95% CI: 1.05, 1.16). A simultaneous Wald test for the inclusion of all terminal nodes in the model was significant, with a chi-square statistic of 34.3 (p = 0.001, with 13 degrees of freedom).ConclusionsRegression trees can be used to hypothesize about joint effects of exposure mixtures and may be particularly useful in the field of air pollution epidemiology for gaining a better understanding of complex multipollutant exposures.

[1]  L. Kuller,et al.  Epidemiological bases for the current ambient carbon monoxide standards. , 1983, Environmental health perspectives.

[2]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[3]  Ricardo Cao,et al.  Evaluating the Ability of Tree‐Based Methods and Logistic Regression for the Detection of SNP‐SNP Interaction , 2009, Annals of human genetics.

[4]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[5]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[6]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[7]  Heping Zhang,et al.  Recursive Partitioning and Applications , 1999 .

[8]  Halûk Özkaynak,et al.  Is the air pollution health research community prepared to support a multipollutant air quality management framework? , 2010, Inhalation toxicology.

[9]  Daniel L. Costa,et al.  Practical Advancement of Multipollutant Scientific and Risk Assessment Approaches for Ambient Air Pollution , 2012, Environmental health perspectives.

[10]  James A Mulholland,et al.  Short-Term Associations Between Ambient Air Pollutants and Pediatric Asthma Emergency Department Visits , 2010, American journal of respiratory and critical care medicine.

[11]  Scott L Zeger,et al.  On the equivalence of case-crossover and time series methods in environmental epidemiology. , 2007, Biostatistics.

[12]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[13]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[14]  Christopher D. Barr,et al.  Protecting Human Health From Air Pollution: Shifting From a Single-pollutant to a Multipollutant Approach , 2010, Epidemiology.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  A. Russell,et al.  Development of Ambient Air Quality Population-Weighted Metrics for Use in Time-Series Health Studies , 2008, Journal of the Air & Waste Management Association.

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  R. J. O'Hara Hines,et al.  Improved Added Variable and Partial Residual Plots for the Detection of Influential Observations in Generalized Linear Models , 1993 .

[19]  K. Lunetta,et al.  Identifying SNPs predictive of phenotype using random forests , 2005, Genetic epidemiology.

[20]  Basabi Chakraborty,et al.  A novel normalization technique for unsupervised learning in ANN , 2000, IEEE Trans. Neural Networks Learn. Syst..

[21]  Bhramar Mukherjee,et al.  Statistical strategies for constructing health risk models with multiple pollutants and their interactions: possible choices and comparisons , 2013, Environmental Health.

[22]  Marnie Bertolet,et al.  Tree-based identification of subgroups for time-varying covariate survival data , 2016, Statistical methods in medical research.

[23]  J. Sarnat,et al.  Multipollutant modeling issues in a study of ambient air quality and emergency department visits in Atlanta , 2007, Journal of Exposure Science and Environmental Epidemiology.

[24]  Y.-S. Shih,et al.  A note on split selection bias in classification trees , 2004, Comput. Stat. Data Anal..

[25]  Thomas A Louis,et al.  Bayesian Model Averaging in Time-Series Studies of Air Pollution and Mortality , 2007, Journal of toxicology and environmental health. Part A.

[26]  Matthew Strickland,et al.  Joint Effects of Ambient Air Pollutants on Pediatric Asthma Emergency Department Visits in Atlanta, 1998–2004 , 2014, Epidemiology.

[27]  Heping Zhang,et al.  Use of classification trees for association studies , 2000, Genetic epidemiology.

[28]  Joel D. Kaufman,et al.  What does multi-pollutant air pollution research mean? , 2011, American journal of respiratory and critical care medicine.

[29]  Steven Roberts,et al.  A critical assessment of shrinkage-based regression approaches for estimating the adverse health effects of multiple air pollutants , 2005 .

[30]  Steven Roberts,et al.  Using Supervised Principal Components Analysis to Assess Multiple Pollutant Effects , 2006, Environmental health perspectives.

[31]  Sander Greenland,et al.  Modern Epidemiology 3rd edition , 1986 .

[32]  Nicholas S Roetker,et al.  Multigene interactions and the prediction of depression in the Wisconsin Longitudinal Study , 2012, BMJ Open.

[33]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[34]  G. Hidy,et al.  Pinnacles and Pitfalls for Source Apportionment of Potential Health Effects From Airborne Particle Exposure , 2007, Inhalation toxicology.

[35]  N. Camp,et al.  Classification tree analysis: a statistical tool to investigate risk factor interactions with an example for colon cancer (United States) , 2002, Cancer Causes & Control.

[36]  Isabella Annesi-Maesano,et al.  Estimating the health effects of exposure to multi-pollutant mixture. , 2012, Annals of epidemiology.

[37]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .