Regression trees modeling of time series for air pollution analysis and forecasting

Solving the problems related to air pollution is crucial for human health and the ecosystems in many urban areas throughout the world. The accumulation of large arrays of data with measurements of various air pollutants makes it possible to analyze these in order to predict and control pollution. This study presents a common approach for building quality nonlinear models of environmental time series by using the powerful data mining technique of classification and regression trees (CART). Predictors for modeling are time series with meteorological, atmospheric or other data, date-time variables and lagged variables of the dependent variable and predictors, involved as groups. The proposed approach is tested in empirical studies of the daily average concentrations of atmospheric PM10 (particulate matter 10 μm in diameter) in the cities of Ruse and Pernik, Bulgaria. A 1-day-ahead forecasts are obtained. All models are cross-validated against overfitting. The best models are selected using goodness-of-fit measures, such as root-mean-square error and coefficient of determination. Relative importance of the predictors and predictor groups is obtained and interpreted. The CART models are compared with the corresponding models built by using ARIMA transfer function methodology, and the superiority of CART over ARIMA is demonstrated. The practical applicability of the models is assessed using 2 × 2 contingency tables. The results show that CART models fit well the data and correctly predict about 90% of measured values of PM10 with respect to the average daily European threshold value of 50 µg/m3.

[2]  P. J. García Nieto,et al.  Nonlinear air quality modeling using multivariate adaptive regression splines in Gijón urban area (Northern Spain) at local scale , 2014, Appl. Math. Comput..

[3]  Nicolas Moussiopoulos,et al.  PM10 forecasting for Thessaloniki, Greece , 2006, Environ. Model. Softw..

[4]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning , 2008 .

[5]  J. Lodge Air quality guidelines for Europe: WHO regional publications, European series, No. 23, World Health Organization, 1211 Geneva 27, Switzerland; WHO publications center USA, 49 Sheridan Avenue, Albany, NY 12210, 1987, xiii + 426 pp. price: Sw. fr. 60 , 1988 .

[6]  G. Weber,et al.  CMARS: a new contribution to nonparametric regression with multivariate adaptive regression splines supported by continuous optimization , 2012 .

[7]  Hamza Abderrahim,et al.  Forecasting PM10 in Algiers: efficacy of multilayer perceptron networks , 2015, Environmental science and pollution research international.

[8]  Martha Cobo,et al.  Discovering relationships and forecasting PM10 and PM2.5 concentrations in Bogotá, Colombia, using Artificial Neural Networks, Principal Component Analysis, and k-means clustering , 2018, Atmospheric Pollution Research.

[9]  W. Briggs Statistical Methods in the Atmospheric Sciences , 2007 .

[10]  P. Liu,et al.  Simulation of the daily average PM10 concentrations at Ta-Liao with Box–Jenkins time series models and multivariate analysis , 2009 .

[11]  Guoqiang Peter Zhang,et al.  Time series forecasting using a hybrid ARIMA and neural network model , 2003, Neurocomputing.

[12]  Gerhard-Wilhelm Weber,et al.  Voxel-MARS: a method for early detection of Alzheimer’s disease by classification of structural brain MRI , 2017, Annals of Operations Research.

[13]  Particulate air pollution and mortality in 38 of China's largest cities: time series analysis , 2017, British Medical Journal.

[14]  Konstantinos Demertzis,et al.  HISYCOL a hybrid computational intelligence system for combined machine learning: the case of air pollution modeling in Athens , 2015, Neural Computing and Applications.

[15]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[16]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[17]  Gerhard-Wilhelm Weber,et al.  Precipitation Modeling by Polyhedral RCMARS and Comparison with MARS and CMARS , 2014, Environmental Modeling & Assessment.

[18]  Ujjwal Kumar,et al.  A Wavelet-based Neural Network Model to Predict Ambient Air Pollutants’ Concentration , 2011 .

[19]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[20]  Arthur M. Winer,et al.  Evaluating meteorological comparability in air quality studies: Classification and regression trees for primary pollutants in California's South Coast Air Basin , 2013 .

[21]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques , 2008 .

[22]  G. De’ath,et al.  CLASSIFICATION AND REGRESSION TREES: A POWERFUL YET SIMPLE TECHNIQUE FOR ECOLOGICAL DATA ANALYSIS , 2000 .

[23]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[24]  J. Gooijer,et al.  Some recent developments in non-linear time series modelling, testing, and forecasting☆ , 1992 .

[25]  Anne C. Lusk,et al.  Bicycle Facilities That Address Safety, Crime, and Economic Development: Perceptions from Morelia, Mexico , 2017, International journal of environmental research and public health.

[26]  S. Zhang,et al.  Forecasting of particulate matter time series using wavelet analysis and wavelet-ARMA/ARIMA model in Taiyuan, China , 2017, Journal of the Air & Waste Management Association.

[27]  Haiyan Lu,et al.  Air Pollution Forecasts: An Overview , 2018, International journal of environmental research and public health.

[28]  William R. Burrows,et al.  CART Decision-Tree Statistical Analysis and Prediction of Summer Season Maximum Surface Ozone for the Vancouver, Montreal, and Atlantic Regions of Canada , 1995 .

[29]  Arwa S. Sayegh,et al.  Understanding how roadside concentrations of NO x are influenced by the background levels, traffic density, and meteorological conditions using Boosted Regression Trees , 2016 .

[30]  Gerhard-Wilhelm Weber,et al.  Inversion of top of atmospheric reflectance values by conic multivariate adaptive regression splines , 2015 .

[31]  A Complex Analysis Employing ARIMA Model and Statistical Methods on Air Pollutants Recorded in Ploiesti, Romania , 2017 .

[32]  Hamid Taheri Shahraiyni,et al.  Statistical Modeling Approaches for PM10 Prediction in Urban Areas; A Review of 21st-Century Studies , 2016 .

[33]  S. Gocheva-Ilieva,et al.  Regression trees modeling and forecasting of PM10 air pollution in urban areas , 2017 .

[34]  Seok-Cheon Park,et al.  Design and implementation of the SARIMA–SVM time series analysis algorithm for the improvement of atmospheric environment forecast accuracy , 2017, Soft Computing.

[35]  David R. Anderson,et al.  Model Selection and Inference: A Practical Information-Theoretic Approach , 2001 .

[36]  P. J. García Nieto,et al.  PM10 concentration forecasting in the metropolitan area of Oviedo (Northern Spain) using models based on SVM, MLP, VARMA and ARIMA: A case study. , 2018, The Science of the total environment.

[37]  D. Dockery,et al.  Acute respiratory effects of particulate air pollution. , 1994, Annual review of public health.

[38]  G. Box,et al.  On a measure of lack of fit in time series models , 1978 .

[39]  Gerhard-Wilhelm Weber,et al.  RMARS: Robustification of multivariate adaptive regression spline under polyhedral uncertainty , 2014, J. Comput. Appl. Math..

[40]  Gerhard-Wilhelm Weber,et al.  Natural gas consumption forecast with MARS and CMARS models for residential users , 2018 .

[41]  Marcella Busilacchio,et al.  Recursive neural network model for analysis and forecast of PM10 and PM2.5 , 2017 .

[42]  Gerhard-Wilhelm Weber,et al.  ROBUST CONIC GENERALIZED PARTIAL LINEAR MODELS USING RCMARS METHOD - A ROBUSTIFICATION OF CGPLM , 2012 .

[43]  Joel Schwartz,et al.  Simultaneous immunisation with influenza vaccine and pneumococcal polysaccharide vaccine in patients with chronic respiratory disease , 1997, BMJ.

[44]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1972 .

[45]  P. Lewis,et al.  Nonlinear Modeling of Time Series Using Multivariate Adaptive Regression Splines (MARS) , 1991 .

[46]  Xindong Wu,et al.  The Top Ten Algorithms in Data Mining , 2009 .

[47]  Gerhard-Wilhelm Weber,et al.  Efficient adaptive regression spline algorithms based on mapping approach with a case study on finance , 2014, Journal of Global Optimization.

[48]  Sanjiban Sekhar Roy,et al.  Predicting Ozone Layer Concentration Using Multivariate Adaptive Regression Splines, Random Forest and Classification and Regression Tree , 2016, SOFA.

[49]  M. Niranjan,et al.  Comparison of Four Machine Learning Methods for Predicting PM10 Concentrations in Helsinki, Finland , 2002 .

[50]  I. Zheleva,et al.  Analysis and modeling of daily air pollutants in the city of Ruse, Bulgaria , 2017 .