Imputing Satellite-Derived Aerosol Optical Depth Using a Multi-Resolution Spatial Model and Random Forest for PM2.5 Prediction

A task for environmental health research is to produce complete pollution exposure maps despite limited monitoring data. Satellite-derived aerosol optical depth (AOD) is frequently used as a predictor in various models to improve PM2.5 estimation, despite significant gaps in coverage. We analyze PM2.5 and AOD from July 2011 in the contiguous United States. We examine two methods to aid in gap-filling AOD: (1) lattice kriging, a spatial statistical method adapted to handle large amounts data, and (2) random forest, a tree-based machine learning method. First, we evaluate each model’s performance in the spatial prediction of AOD, and we additionally consider ensemble methods for combining the predictors. In order to accurately assess the predictive performance of these methods, we construct spatially clustered holdouts to mimic the observed patterns of missing data. Finally, we assess whether gap-filling AOD through one of the proposed ensemble methods can improve prediction of PM2.5 in a random forest model. Our results suggest that ensemble methods of combining lattice kriging and random forest can improve AOD gap-filling. Based on summary metrics of performance, PM2.5 predictions based on random forest models were largely similar regardless of the inclusion of gap-filled AOD, but there was some variability in daily model predictions.

[1]  Shihao Tang,et al.  Estimation of hourly full-coverage PM2.5 concentrations at 1-km resolution in China using a two-stage random forest model , 2021 .

[2]  F. Liang,et al.  The 17-y spatiotemporal trend of PM2.5 and its mortality burden in China , 2020, Proceedings of the National Academy of Sciences.

[3]  Edzer Pebesma,et al.  Spatiotemporal Multi-Resolution Approximations for Analyzing Global Environmental Data , 2020, Spatial Statistics.

[4]  Zongwei Ma,et al.  Estimating daily ground-level PM2.5 in China with random-forest-based spatiotemporal kriging. , 2020, The Science of the total environment.

[5]  Jungho Im,et al.  Estimating ground-level particulate matter concentrations using satellite-based data: a review , 2020 .

[6]  Supreme Champion Ram,et al.  Overall , 2020, Definitions.

[7]  Alexei Lyapustin,et al.  Estimating daily PM2.5 concentrations in New York City at the neighborhood-scale: Implications for integrating non-regulatory measurements. , 2019, The Science of the total environment.

[8]  Jonathan R. Bradley,et al.  What is the best predictor that you can compute in five minutes using a given Bayesian hierarchical model , 2019, 1912.04542.

[9]  J. Schwartz,et al.  An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution. , 2019, Environment international.

[10]  Yang Liu,et al.  A Bayesian ensemble approach to combine PM2.5 estimates from statistical models using satellite imagery and numerical model simulation. , 2019, Environmental research.

[11]  Jun Yang,et al.  Extreme gradient boosting model to estimate PM2.5 concentrations with missing-filled satellite data in China , 2019, Atmospheric Environment.

[12]  Itai Kloog,et al.  Gaussian Markov Random Fields versus Linear Mixed Models for satellite-based PM2.5 assessment: Evidence from the Northeastern USA , 2019, Atmospheric Environment.

[13]  David G. Streets,et al.  Using gap-filled MAIAC AOD and WRF-Chem to estimate daily PM2.5 concentrations at 1 km resolution in the Eastern United States , 2019, Atmospheric Environment.

[14]  Alexei Lyapustin,et al.  Impacts of snow and cloud covers on satellite-derived PM2.5 levels. , 2019, Remote sensing of environment.

[15]  Alexei Lyapustin,et al.  Estimation of daily PM10 and PM2.5 concentrations in Italy, 2013-2015, using a spatiotemporal land-use random-forest model. , 2019, Environment international.

[16]  Baofeng Di,et al.  A nonparametric approach to filling gaps in satellite-retrieved aerosol optical depth for estimating ambient PM2.5 levels. , 2018, Environmental pollution.

[17]  Qingyang Xiao,et al.  Predicting monthly high-resolution PM2.5 concentrations with random forest model in the North China Plain. , 2018, Environmental pollution.

[18]  Qingyang Xiao,et al.  An Ensemble Machine-Learning Model To Predict Historical PM2.5 Concentrations in China from Satellite Data. , 2018, Environmental science & technology.

[19]  Xuefei Hu,et al.  Satellite‐Based Daily PM2.5 Estimates During Fire Seasons in Colorado , 2018, Journal of geophysical research. Atmospheres : JGR.

[20]  Jane Elith,et al.  blockCV: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models , 2018, bioRxiv.

[21]  Qingyang Xiao,et al.  Full-Coverage High-Resolution Daily PM2.5 Estimation using MAIAC AOD in the Yangtze River Delta of China , 2018 .

[22]  Dorit Hammerling,et al.  A Case Study Competition Among Methods for Analyzing Large Spatial Data , 2017, Journal of Agricultural, Biological and Environmental Statistics.

[23]  Howard H. Chang,et al.  The Potential Impact of Satellite-Retrieved Cloud Parameters on Ground-Level PM2.5 Mass and Composition , 2017, International journal of environmental research and public health.

[24]  Ashley I. Naimi,et al.  Stacked generalization: an introduction to super learning , 2017, bioRxiv.

[25]  J. H. Belle,et al.  Estimating PM2.5 Concentrations in the Conterminous United States Using the Random Forest Approach. , 2017, Environmental science & technology.

[26]  Zhongmin Zhu,et al.  A Review on Predicting Ground PM2.5 Concentration Using Satellite Aerosol Optical Depth , 2016 .

[27]  Yang Liu,et al.  Evaluation of Aqua MODIS Collection 6 AOD Parameters for Air Quality Research over the Continental United States , 2016, Remote. Sens..

[28]  J. Schwartz,et al.  Spatiotemporal prediction of fine particulate matter using high-resolution satellite images in the Southeastern US 2003–2011 , 2016, Journal of Exposure Science and Environmental Epidemiology.

[29]  Mark J. van der Laan,et al.  Optimal Spatial Prediction Using Ensemble Machine Learning , 2016, The international journal of biostatistics.

[30]  Armistead G Russell,et al.  Improving the Accuracy of Daily PM2.5 Distributions Derived from the Fusion of Ground-Level Measurements with Aerosol Optical Depth Observations, a Case Study in North China. , 2016, Environmental science & technology.

[31]  Allan C Just,et al.  Satellite remote sensing in epidemiological studies , 2016, Current opinion in pediatrics.

[32]  Lianne Sheppard,et al.  Satellite-Based NO2 and Model Validation in a National Prediction Model Based on Universal Kriging and Land-Use Regression. , 2016, Environmental science & technology.

[33]  Douglas W. Nychka,et al.  Multiresolution Kriging Based on Markov Random Fields , 2015 .

[34]  N. Hamm,et al.  NONSEPARABLE DYNAMIC NEAREST NEIGHBOR GAUSSIAN PROCESS MODELS FOR LARGE SPATIO-TEMPORAL DATA WITH AN APPLICATION TO PARTICULATE MATTER ANALYSIS. , 2015, The annals of applied statistics.

[35]  J. Lelieveld,et al.  The contribution of outdoor air pollution sources to premature mortality on a global scale , 2015, Nature.

[36]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[37]  Matthias Katzfuss,et al.  A Multi-Resolution Approximation for Massive Spatial Datasets , 2015, 1507.04789.

[38]  D. Nychka,et al.  A Multiresolution Gaussian Process Model for the Analysis of Large Spatial Datasets , 2015 .

[39]  Robert C. Levy,et al.  MODIS Collection 6 aerosol products: Comparison between Aqua's e‐Deep Blue, Dark Target, and “merged” data sets, and usage recommendations , 2014 .

[40]  Jonathan R. Bradley,et al.  A comparison of spatial predictors when datasets could be very large , 2014, 1410.7748.

[41]  A. Gelfand,et al.  Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets , 2014, Journal of the American Statistical Association.

[42]  L. Remer,et al.  The Collection 6 MODIS aerosol products over land and ocean , 2013 .

[43]  M. Garay,et al.  Comparison of GEOS‐Chem aerosol optical depth with AERONET and MISR data over the contiguous United States , 2013 .

[44]  Alan D. Lopez,et al.  A comparative risk assessment of burden of disease and injury attributable to 67 risk factors and risk factor clusters in 21 regions, 1990–2010: a systematic analysis for the Global Burden of Disease Study 2010 , 2012, The Lancet.

[45]  J. Schwartz,et al.  Incorporating local land use regression and satellite aerosol optical depth in a hybrid model of spatiotemporal PM2.5 exposures in the Mid-Atlantic states. , 2012, Environmental science & technology.

[46]  J. Schwartz,et al.  Assessing temporally and spatially resolved PM2.5 exposures for epidemiological studies using satellite aerosol optical depth measurements , 2011 .

[47]  H. Rue,et al.  An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach , 2011 .

[48]  P. Gupta,et al.  Satellite Remote Sensing of Particulate Matter Air Quality: The Cloud-Cover Problem , 2010, Journal of the Air & Waste Management Association.

[49]  N. Cressie,et al.  Fixed Rank Filtering for Spatio-Temporal Data , 2010 .

[50]  J. D. Tarpley,et al.  The multi‐institution North American Land Data Assimilation System (NLDAS): Utilizing multiple GCIP products and partners in a continental distributed hydrological modeling system , 2004 .

[51]  J. D. Tarpley,et al.  Real‐time and retrospective forcing in the North American Land Data Assimilation System (NLDAS) project , 2003 .

[52]  D. Jacob,et al.  Global modeling of tropospheric chemistry with assimilated meteorology : Model description and evaluation , 2001 .

[53]  A. Smirnov,et al.  AERONET-a federated instrument network and data archive for aerosol Characterization , 1998 .

[54]  Geoffrey M. Laslett,et al.  Kriging and Splines: An Empirical Comparison of their Predictive Performance in Some Applications , 1994 .

[55]  Mark J. van der Laan,et al.  Super Learner In Prediction , 2010 .

[56]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[57]  L. Breiman Random Forests , 2001, Machine Learning.