A comparison of statistical and machine learning methods for creating national daily maps of ambient PM2.5 concentration.

A typical challenge in air pollution epidemiology is to perform detailed exposure assessment for individuals for which health data are available. To address this problem, in the last few years, substantial research efforts have been placed in developing statistical methods or machine learning techniques to generate estimates of air pollution at fine spatial and temporal scales (daily, usually) with complete coverage. However, it is not clear how much the predicted exposures yielded by the various methods differ, and which method generates more reliable estimates. In this paper, we aim to address this gap by evaluating a variety of exposure modeling approaches, comparing their predictive performance. Using PM2.5 in year 2011 over the continental U.S. as a case study, we generate national maps of ambient PM2.5 concentration using: (i) ordinary least squares and inverse distance weighting; (ii) kriging; (iii) statistical downscaling models, that is, spatial statistical models that use the information contained in air quality model outputs; (iv) land use regression, that is, linear regression modeling approaches that leverage the information in Geographical Information System (GIS) covariates; and (v) machine learning methods, such as neural networks, random forests and support vector regression. We examine the various methods' predictive performance via cross-validation using Root Mean Squared Error, Mean Absolute Deviation, Pearson correlation, and Mean Spatial Pearson Correlation. Additionally, we evaluated whether factors such as, season, urbanicty, and levels of PM2.5 concentration (low, medium or high) affected the performance of the different methods. Overall, statistical methods that explicitly modeled the spatial correlation, e.g. universal kriging and the downscaler model, outperform all the other exposure assessment approaches regardless of season, urbanicity and PM2.5 concentration level. We posit that the better predictive performance of spatial statistical models over machine learning methods is due to the fact that they explicitly account for spatial dependence, thus borrowing information from neighboring observations. In light of our findings, we suggest that future exposure assessment methods for regional PM2.5 incorporate information from neighboring sites when deriving predictions at unsampled locations or attempt to account for spatial dependence.

[1]  Howard H. Chang,et al.  Cross-comparison and evaluation of air pollution field estimation methods , 2018 .

[2]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[3]  Essa Yacoub,et al.  The WU-Minn Human Connectome Project: An overview , 2013, NeuroImage.

[4]  Richard T Burnett,et al.  High-Resolution Satellite-Derived PM2.5 from Optimal Estimation and Geographically Weighted Regression over North America. , 2015, Environmental science & technology.

[5]  G. Pfister,et al.  Spatiotemporal prediction of fine particulate matter during the 2008 northern California wildfires using machine learning. , 2015, Environmental science & technology.

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Andrew O. Finley,et al.  spBayes for Large Univariate and Multivariate Point-Referenced Spatio-Temporal Data Models , 2013, 1310.8192.

[8]  Joshua P. Keller,et al.  Combining Land-Use Regression and Chemical Transport Modeling in a Spatiotemporal Geostatistical Model for Ozone and PM2.5. , 2016, Environmental science & technology.

[9]  Cardona Alzate,et al.  Predicción y selección de variables con bosques aleatorios en presencia de variables correlacionadas , 2020 .

[10]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[11]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[12]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[13]  Yujie Wang,et al.  Assessing PM2.5 Exposures with High Spatiotemporal Resolution across the Continental United States. , 2016, Environmental science & technology.

[14]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[15]  Brian J. Reich,et al.  Time-to-event analysis of fine particle air pollution and preterm birth: results from North Carolina, 2001-2005. , 2012, American journal of epidemiology.

[16]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[17]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2001, Springer Series in Statistics.

[18]  Joseph Frostad,et al.  Data Integration for the Assessment of Population Exposure to Ambient Air Pollution for Global Burden of Disease Assessment. , 2018, Environmental science & technology.

[19]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  J. Schwartz,et al.  Acute and Chronic Effects of Particles on Hospital Admissions in New-England , 2012, PloS one.

[22]  Yang Liu,et al.  Pediatric Emergency Visits and Short-Term Changes in PM2.5 Concentrations in the U.S. State of Georgia , 2015, Environmental health perspectives.

[23]  S. Batterman,et al.  Nonstationary spatiotemporal Bayesian data fusion for pollutants in the near‐road environment , 2019, Environmetrics.

[24]  Yuval,et al.  A new modeling approach for assessing the contribution of industrial and traffic emissions to ambient NOx concentrations , 2018 .

[25]  D. Byun,et al.  Review of the Governing Equations, Computational Algorithms, and Other Components of the Models-3 Community Multiscale Air Quality (CMAQ) Modeling System , 2006 .

[26]  Shikha Gupta,et al.  Identifying pollution sources and predicting urban air quality using ensemble learning methods , 2013 .

[27]  Jo Eidsvik,et al.  A class of covariate-dependent spatiotemporal covariance functions. , 2011, The annals of applied statistics.

[28]  Itai Kloog,et al.  Consequences of kriging and land use regression for PM2.5 predictions in epidemiologic analyses: insights into spatial variability using high-resolution satellite data , 2014, Journal of Exposure Science and Environmental Epidemiology.

[29]  S. Batterman,et al.  Non-stationary spatio-temporal modeling of traffic-related pollutants in near-road environments. , 2016, Spatial and spatio-temporal epidemiology.

[30]  Alan E Gelfand,et al.  A Spatio-Temporal Downscaler for Output From Numerical Models , 2010, Journal of agricultural, biological, and environmental statistics.

[31]  David B. Dunson,et al.  Bayesian data analysis, third edition , 2013 .

[32]  J. Gulliver,et al.  A review of land-use regression models to assess spatial variation of outdoor air pollution , 2008 .

[33]  J. H. Belle,et al.  Estimating PM2.5 Concentrations in the Conterminous United States Using the Random Forest Approach. , 2017, Environmental science & technology.

[34]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[35]  Marta Blangiardo,et al.  Using building heights and street configuration to enhance intraurban PM10, NO(X), and NO2 land use regression models. , 2013, Environmental science & technology.

[36]  Antonella Zanobetti,et al.  Association of Short-term Exposure to Air Pollution With Mortality in Older Adults , 2017, JAMA.

[37]  P. Alam ‘G’ , 2021, Composites Engineering: An A–Z Guide.

[38]  Manoj Kumar Tiwari,et al.  Urban air quality forecasting based on multi-dimensional collaborative Support Vector Regression (SVR): A case study of Beijing-Tianjin-Shijiazhuang , 2017, PloS one.

[39]  Michael Jerrett,et al.  Spatiotemporal Modeling of Ozone Levels in Quebec (Canada): A Comparison of Kriging, Land-Use Regression (LUR), and Combined Bayesian Maximum Entropy–LUR Approaches , 2014, Environmental health perspectives.

[40]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[41]  Johan Lindström,et al.  A Unified Spatiotemporal Modeling Approach for Predicting Concentrations of Multiple Air Pollutants in the Multi-Ethnic Study of Atherosclerosis and Air Pollution , 2014, Environmental health perspectives.

[42]  Howard H. Chang,et al.  Air Pollution and Preterm Birth in the U.S. State of Georgia (2002–2006): Associations with Concentrations of 11 Ambient Air Pollutants Estimated by Combining Community Multiscale Air Quality Model (CMAQ) Simulations with Stationary Monitor Measurements , 2015, Environmental health perspectives.