A comparison of linear regression, regularization, and machine learning algorithms to develop Europe-wide spatial models of fine particles and nitrogen dioxide.

Empirical spatial air pollution models have been applied extensively to assess exposure in epidemiological studies with increasingly sophisticated and complex statistical algorithms beyond ordinary linear regression. However, different algorithms have rarely been compared in terms of their predictive ability. This study compared 16 algorithms to predict annual average fine particle (PM2.5) and nitrogen dioxide (NO2) concentrations across Europe. The evaluated algorithms included linear stepwise regression, regularization techniques and machine learning methods. Air pollution models were developed based on the 2010 routine monitoring data from the AIRBASE dataset maintained by the European Environmental Agency (543 sites for PM2.5 and 2399 sites for NO2), using satellite observations, dispersion model estimates and land use variables as predictors. We compared the models by performing five-fold cross-validation (CV) and by external validation (EV) using annual average concentrations measured at 416 (PM2.5) and 1396 sites (NO2) from the ESCAPE study. We further assessed the correlations between predictions by each pair of algorithms at the ESCAPE sites. For PM2.5, the models performed similarly across algorithms with a mean CV R2 of 0.59 and a mean EV R2 of 0.53. Generalized boosted machine, random forest and bagging performed best (CV R2~0.63; EV R2 0.58-0.61), while backward stepwise linear regression, support vector regression and artificial neural network performed less well (CV R2 0.48-0.57; EV R2 0.39-0.46). Most of the PM2.5 model predictions at ESCAPE sites were highly correlated (R2 > 0.85, with the exception of predictions from the artificial neural network). For NO2, the models performed even more similarly across different algorithms, with CV R2s ranging from 0.57 to 0.62, and EV R2s ranging from 0.49 to 0.51. The predicted concentrations from all algorithms at ESCAPE sites were highly correlated (R2 > 0.9). For both pollutants, biases were low for all models except the artificial neural network. Dispersion model estimates and satellite observations were two of the most important predictors for PM2.5 models whilst dispersion model estimates and traffic variables were most important for NO2 models in all algorithms that allow assessment of the importance of variables. Different statistical algorithms performed similarly when modelling spatial variation in annual average air pollution concentrations using a large number of training sites.

[1]  Bert Brunekreef,et al.  Development of West-European PM2.5 and NO2 land use regression models incorporating satellite-derived and chemical transport modelling data. , 2016, Environmental research.

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  B. Brunekreef,et al.  Comparison of Ultrafine Particle and Black Carbon Concentration Predictions from a Mobile and Short-Term Stationary Land-Use Regression Model. , 2016, Environmental science & technology.

[4]  K. Kita,et al.  Comparison of laser-induced fluorescence and chemiluminescence measurements of NO2 at an urban site , 2011 .

[5]  Bernard De Baets,et al.  Development and evaluation of land use regression models for black carbon based on bicycle and pedestrian measurements in the urban environment , 2018, Environ. Model. Softw..

[6]  Jeremy Ferwerda,et al.  Kernel-Based Regularized Least Squares in R (KRLS) and Stata (krls) , 2017 .

[7]  Julian D. Marshall,et al.  Remote sensing of exposure to NO2: Satellite versus ground-based measurement in a large urban area , 2013 .

[8]  Simon Kingham,et al.  Mapping Urban Air Pollution Using GIS: A Regression-Based Approach , 1997, Int. J. Geogr. Inf. Sci..

[9]  Bert Brunekreef,et al.  Estimating Long-Term Average Particulate Air Pollution Concentrations: Application of Traffic Indicators and Geographic Information Systems , 2003, Epidemiology.

[10]  Jean-Noël Thépaut,et al.  The MACC reanalysis: an 8 yr data set of atmospheric composition , 2012 .

[11]  B. Brunekreef,et al.  Spatial variation of PM2.5, PM10, PM2.5 absorbance and PMcoarse concentrations between and within 20 European study areas and the relationship with NO2 : results of the ESCAPE project , 2012 .

[12]  Marcela Rivera,et al.  Effect of the number of measurement sites on land use regression models in estimating local air pollution , 2012 .

[13]  J. Schwartz,et al.  A hybrid prediction model for PM2.5 mass and components using a chemical transport model and land use regression , 2016 .

[14]  A. Peters,et al.  Variation of NO2 and NOx concentrations between and within 36 European study areas: Results from the ESCAPE study , 2012 .

[15]  Yujie Wang,et al.  Assessing PM2.5 Exposures with High Spatiotemporal Resolution across the Continental United States. , 2016, Environmental science & technology.

[16]  Kees de Hoogh,et al.  Western European land use regression incorporating satellite- and ground-based measurements of NO2 and PM10. , 2013, Environmental science & technology.

[17]  B. Brunekreef,et al.  Performance of Prediction Algorithms for Modeling Outdoor Air Pollution Spatial Surfaces. , 2019, Environmental science & technology.

[18]  G. Pfister,et al.  Spatiotemporal prediction of fine particulate matter during the 2008 northern California wildfires using machine learning. , 2015, Environmental science & technology.

[19]  Michael Brauer,et al.  Associations between fine particulate matter and mortality in the 2001 Canadian Census Health and Environment Cohort , 2017, Environmental research.

[20]  Marianne Hatzopoulou,et al.  A land use regression model for ambient ultrafine particles in Montreal, Canada: A comparison of linear regression and a machine learning approach. , 2016, Environmental research.

[21]  Xin Fang,et al.  Spatial modeling of PM2.5 concentrations with a multifactoral radial basis function neural network , 2015, Environmental Science and Pollution Research.

[22]  B. Brunekreef,et al.  Systematic evaluation of land use regression models for NO₂. , 2012, Environmental science & technology.

[23]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[24]  Yu Zhan,et al.  Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm , 2017 .

[25]  Yang Liu,et al.  Estimating Regional Spatial and Temporal Variability of PM2.5 Concentrations Using Satellite Data, Meteorology, and Land Use Information , 2009, Environmental health perspectives.

[26]  Martina S. Ragettli,et al.  Performance of Multi-City Land Use Regression Models for Nitrogen Dioxide and Fine Particles , 2014, Environmental health perspectives.

[27]  Pratim Biswas,et al.  A land-use regression model for estimating microenvironmental diesel exposure given multiple addresses from birth through childhood. , 2008, The Science of the total environment.

[28]  Paolo Vineis,et al.  A Systematic Comparison of Linear Regression–Based Statistical Methods to Assess Exposome-Health Associations , 2016, Environmental health perspectives.

[29]  Baofeng Di,et al.  Satellite-Based Estimates of Daily NO2 Exposure in China Using Hybrid Random Forest and Spatiotemporal Kriging Model. , 2018, Environmental science & technology.

[30]  Julian D Marshall,et al.  National satellite-based land-use regression: NO2 in the United States. , 2011, Environmental science & technology.

[31]  Michael Brauer,et al.  Application of land use regression to estimate long-term concentrations of traffic-related nitrogen oxides and fine particulate matter. , 2007, Environmental science & technology.

[32]  Anu W. Turunen,et al.  Effects of long-term exposure to air pollution on natural-cause mortality: an analysis of 22 European cohorts within the multicentre ESCAPE project , 2014, The Lancet.

[33]  Matthias Ketzel,et al.  Spatial PM2.5, NO2, O3 and BC models for Western Europe - Evaluation of spatiotemporal stability. , 2018, Environment international.

[34]  Itai Kloog,et al.  Modelling daily PM2.5 concentrations at high spatio-temporal resolution across Switzerland. , 2018, Environmental pollution.

[35]  Yan Wang,et al.  Air Pollution and Mortality in the Medicare Population , 2017, The New England journal of medicine.

[36]  G. Leeuw,et al.  Exploring the relation between aerosol optical depth and PM 2.5 at Cabauw, the Netherlands , 2008 .

[37]  M. Brauer,et al.  Creating National Air Pollution Models for Population Exposure Assessment in Canada , 2011, Environmental health perspectives.

[38]  G. Lemasters,et al.  Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches. , 2017, Atmospheric environment.

[39]  Johan Lindström,et al.  Comparing universal kriging and land-use regression for predicting concentrations of gaseous oxides of nitrogen (NOx) for the Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air). , 2011, Atmospheric environment.

[40]  G. Hoek Methods for Assessing Long-Term Exposures to Outdoor Air Pollutants , 2017, Current Environmental Health Reports.

[41]  Alexei Lyapustin,et al.  Estimation of daily PM10 concentrations in Italy (2006-2012) using finely resolved satellite data, land use variables and meteorology. , 2017, Environment international.

[42]  M. Brauer,et al.  Use of Satellite Observations for Long-Term Exposure Assessment of Global Concentrations of Fine Particulate Matter , 2014, Environmental health perspectives.

[43]  R. Beelen,et al.  Comparison of land-use regression models between Great Britain and the Netherlands , 2010 .

[44]  Dan L. Crouse,et al.  A prediction-based approach to modelling temporal and spatial variability of traffic-related air pollution in Montreal, Canada , 2009 .

[45]  Zev Ross,et al.  Application of the deletion/substitution/addition algorithm to selecting land use regression models for interpolating air pollution measurements in California , 2013 .

[46]  Cole Brokamp,et al.  Predicting Daily Urban Fine Particulate Matter Concentrations Using a Random Forest Model. , 2018, Environmental science & technology.

[47]  M. Shima,et al.  Spatiotemporal land use random forest model for estimating metropolitan NO2 exposure in Japan. , 2018, The Science of the total environment.

[48]  J. Gulliver,et al.  A review of land-use regression models to assess spatial variation of outdoor air pollution , 2008 .

[49]  J. Marshall,et al.  National Spatiotemporal Exposure Surface for NO2: Monthly Scaling of a Satellite-Derived Land-Use Regression, 2000-2010. , 2015, Environmental science & technology.

[50]  P. Sampson,et al.  Prediction of fine particulate matter chemical components with a spatio-temporal model for the Multi-Ethnic Study of Atherosclerosis cohort , 2016, Journal of Exposure Science and Environmental Epidemiology.

[51]  Jiangshe Zhang,et al.  Prediction of Air Pollutants Concentration Based on an Extreme Learning Machine: The Case of Hong Kong , 2017, International journal of environmental research and public health.

[52]  Julian D. Marshall,et al.  National Satellite-based Land Use Regression: NO2 in the United States , 2011 .

[53]  J. H. Belle,et al.  Estimating PM2.5 Concentrations in the Conterminous United States Using the Random Forest Approach. , 2017, Environmental science & technology.