Evaluation of random forest regression and multiple linear regression for predicting indoor fine particulate matter concentrations in a highly polluted city.

BACKGROUND Indoor and outdoor fine particulate matter (PM2.5) are both leading risk factors for death and disease, but making indoor measurements is often infeasible for large study populations. METHODS We developed models to predict indoor PM2.5 concentrations for pregnant women who were part of a randomized controlled trial of portable air cleaners in Ulaanbaatar, Mongolia. We used multiple linear regression (MLR) and random forest regression (RFR) to model indoor PM2.5 concentrations with 447 independent 7-day PM2.5 measurements and 87 potential predictor variables obtained from outdoor monitoring data, questionnaires, home assessments, and geographic data sets. We also developed blended models that combined the MLR and RFR approaches. All models were evaluated in a 10-fold cross-validation. RESULTS The predictors in the MLR model were season, outdoor PM2.5 concentration, the number of air cleaners deployed, and the density of gers (traditional felt-lined yurts) surrounding the apartments. MLR and RFR had similar performance in cross-validation (R2 = 50.2%, R2 = 48.9% respectively). The blended MLR model that included RFR predictions had the best performance (cross validation R2 = 81.5%). Intervention status alone explained only 6.0% of the variation in indoor PM2.5 concentrations. CONCLUSIONS We predicted a moderate amount of variation in indoor PM2.5 concentrations using easily obtained predictor variables and the models explained substantially more variation than intervention status alone. While RFR shows promise for modelling indoor concentrations, our results highlight the importance of out-of-sample validation when evaluating model performance. We also demonstrate the improved performance of blended MLR/RFR models in predicting indoor air pollution.

[1]  Christian Schweizer,et al.  Determinants of indoor air concentrations of PM2.5, black smoke and NO2 in six European cities (EXPOLIS study) , 2006 .

[2]  Tim K. Takaro,et al.  An assessment of air pollution and its attributable mortality in Ulaanbaatar, Mongolia , 2011, Air Quality, Atmosphere & Health.

[3]  L. Sheppard,et al.  Asthma aggravation, combustion, and stagnant air , 2000, Thorax.

[4]  Michael D. Moran,et al.  Blending forest fire smoke forecasts with observed data can improve their utility for public health applications , 2016 .

[5]  H. Wichmann,et al.  Relationship between indoor and outdoor levels of fine particle mass, particle number concentrations and black smoke under different ventilation conditions , 2004, Journal of Exposure Analysis and Environmental Epidemiology.

[6]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  Lianne Sheppard,et al.  Approach to estimating participant pollutant exposures in the Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air). , 2009, Environmental science & technology.

[9]  E. Edirisinghe,et al.  Modelling ground-level ozone concentration using ensemble learning algorithms , 2015 .

[10]  Michael Brauer,et al.  Application of land use regression to estimate long-term concentrations of traffic-related nitrogen oxides and fine particulate matter. , 2007, Environmental science & technology.

[11]  M. Elbayoumi,et al.  Multivariate methods for indoor PM10 and PM2.5 modelling in naturally ventilated schools buildings , 2014 .

[12]  R. Hornung,et al.  Determinants of serum cotinine and hair cotinine as biomarkers of childhood secondhand smoke exposure , 2010, Journal of Exposure Science and Environmental Epidemiology.

[13]  B. Lanphear,et al.  The effect of portable HEPA filter air cleaners on indoor PM2.5 concentrations and second hand tobacco smoke exposure among pregnant women in Ulaanbaatar, Mongolia: The UGAAR randomized controlled trial. , 2018, The Science of the total environment.

[14]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[15]  D. Shagjjamba,et al.  Air particulate matter pollution in Ulaanbaatar, Mongolia: determination of composition, source contributions and source locations , 2011 .

[16]  P. Ryan,et al.  Analysis of Personal and Home Characteristics Associated with the Elemental Composition of PM2.5 in Indoor, Outdoor, and Personal Air in the RIOPA Study. , 2015, Research report.

[17]  Jean Curtin-Brosnan,et al.  A randomized trial of air cleaners and a health coach to improve indoor air quality for inner-city children with asthma and secondhand smoke exposure. , 2011, Archives of pediatrics & adolescent medicine.

[18]  S. Phillips,et al.  Hybrid Modeling Approach to Estimate Exposures of Hazardous Air Pollutants (HAPs) for the National Air Toxics Assessment (NATA). , 2016, Environmental science & technology.

[19]  Markus Steiner,et al.  Using a new, low-cost air quality sensor to quantify second-hand smoke (SHS) levels in homes , 2013, Tobacco Control.

[20]  J. H. Belle,et al.  Estimating PM2.5 Concentrations in the Conterminous United States Using the Random Forest Approach. , 2017, Environmental science & technology.

[21]  Outdoor wood furnaces create significant indoor particulate pollution in neighboring homes , 2014, Inhalation toxicology.

[22]  Qing Yu Meng,et al.  Determinants of Indoor and Personal Exposure to PM(2.5) of Indoor and Outdoor Origin during the RIOPA Study. , 2009, Atmospheric environment.

[23]  Jonathan Cheung-Wai Chan,et al.  Evaluation of random forest and adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery , 2008 .

[24]  M. Niranjan,et al.  Comparison of Four Machine Learning Methods for Predicting PM10 Concentrations in Helsinki, Finland , 2002 .

[25]  Jonathan P. Resop,et al.  Random Forests for Global and Regional Crop Yield Predictions , 2016, PloS one.

[26]  Chun Lin,et al.  Personal exposure monitoring of PM2.5 in indoor and outdoor microenvironments. , 2015, The Science of the total environment.

[27]  Antonio Krüger,et al.  Applying indoor and outdoor modeling techniques to estimate individual exposure to PM2.5 from personal GPS profiles and diaries: a pilot study. , 2009, The Science of the total environment.

[28]  Lianne Sheppard,et al.  Modeling the Residential Infiltration of Outdoor PM2.5 in the Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air) , 2012, Environmental health perspectives.

[29]  A. Knudby,et al.  A description of methods for deriving air pollution land use regression model predictor variables from remote sensing data in Ulaanbaatar, Mongolia , 2016 .

[30]  Mukesh Khare,et al.  Indoor air quality modeling for PM10, PM2.5, and PM1.0 in naturally ventilated classrooms of an urban Indian school building , 2011, Environmental monitoring and assessment.

[31]  Adam Szpiro,et al.  Improving spatial concentration estimates for nitrogen oxides using a hybrid meteorological dispersion/land use regression model in Los Angeles, CA and Seattle, WA. , 2010, The Science of the total environment.

[32]  R. Edwards,et al.  Health assessment of future PM2.5 exposures from indoor, outdoor, and secondhand tobacco smoke concentrations under alternative policy pathways in Ulaanbaatar, Mongolia , 2017, PloS one.

[33]  Matti Jantunen,et al.  Comparison of black smoke and PM2.5 levels in indoor and outdoor environments of four European cities. , 2002, Environmental science & technology.

[34]  S. Cassadou,et al.  Contribution of indoor and outdoor environments to PM2.5 personal exposure of children--VESTA study. , 2002, The Science of the total environment.

[35]  W. J. Fisk,et al.  Effectiveness and cost of reducing particle‐related mortality with particle filtration , 2017, Indoor air.

[36]  M. Elbayoumi,et al.  Development and comparison of regression models and feedforward backpropagation neural network models to predict seasonal indoor PM2.5–10 and PM2.5 concentrations in naturally ventilated schools , 2015 .

[37]  Derangula Lokesh Models for Indoor Pollution and Health Impact Assessment – An Overview , 2013 .

[38]  Tracy Allen,et al.  A low-cost particle counter as a realtime fine-particle mass monitor. , 2013, Environmental science. Processes & impacts.

[39]  Ana Russo,et al.  Hybrid Model for Urban Air Pollution Forecasting: A Stochastic Spatio-Temporal Approach , 2013, Mathematical Geosciences.

[40]  A. Prasad,et al.  Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction , 2006, Ecosystems.

[41]  Christer Johansson,et al.  Can dispersion modeling of air pollution be improved by land-use regression? An example from Stockholm, Sweden , 2016, Journal of Exposure Science and Environmental Epidemiology.

[42]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[43]  Ping Liu,et al.  A comparison of random forest regression and multiple linear regression for prediction in neuroscience , 2013, Journal of Neuroscience Methods.