Feature and Language Selection in Temporal Symbolic Regression for Interpretable Air Quality Modelling

Air quality modelling that relates meteorological, car traffic, and pollution data is a fundamental problem, approached in several different ways in the recent literature. In particular, a set of such data sampled at a specific location and during a specific period of time can be seen as a multivariate time series, and modelling the values of the pollutant concentrations can be seen as a multivariate temporal regression problem. In this paper, we propose a new method for symbolic multivariate temporal regression, and we apply it to several data sets that contain real air quality data from the city of Wrocław (Poland). Our experiments show that our approach is superior to classical, especially symbolic, ones, both in statistical performances and the interpretability of the results.

[1]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[2]  Antonio J. Nebro,et al.  jMetal: A Java framework for multi-objective optimization , 2011, Adv. Eng. Softw..

[3]  G A Norris,et al.  Associations between air pollution and mortality in Phoenix, 1995-1997. , 2000, Environmental health perspectives.

[4]  Xiaoming Xu,et al.  A hybrid genetic algorithm for feature selection wrapper based on mutual information , 2007, Pattern Recognit. Lett..

[5]  Yoav Shoham,et al.  A propositional modal logic of time intervals , 1991, JACM.

[6]  M. Esmel ElAlami A filter model for feature subset selection based on genetic algorithm , 2009, Knowl. Based Syst..

[7]  Gail M Williams,et al.  The Australian Child Health and Air Pollution Study (ACHAPS): A national population-based cross-sectional study of long-term exposure to outdoor air pollution, asthma, and lung function. , 2018, Environment international.

[8]  J. Kamińska,et al.  A random forest partition model for predicting NO2 concentrations from traffic flow and meteorological conditions. , 2019, The Science of the total environment.

[9]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[10]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[11]  Nagamma Patil,et al.  Genetic algorithm based wrapper feature selection on hybrid prediction model for analysis of high dimensional data , 2014, 2014 9th International Conference on Industrial and Information Systems (ICIIS).

[12]  P. J. García Nieto,et al.  PM10 concentration forecasting in the metropolitan area of Oviedo (Northern Spain) using models based on SVM, MLP, VARMA and ARIMA: A case study. , 2018, The Science of the total environment.

[13]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[14]  Mark S Goldberg,et al.  Assessing Spatial Variability of Ambient Nitrogen Dioxide in Montréal, Canada, with a Land-Use Regression Model , 2005, Journal of the Air & Waste Management Association.

[15]  P. Siarry,et al.  Multiobjective Optimization: Principles and Case Studies , 2004 .

[16]  Antonino Staiano,et al.  Spatio-temporal learning in predicting ambient particulate matter concentration by multi-layer perceptron , 2019, Ecol. Informatics.

[17]  Michael Brauer,et al.  Application of land use regression to estimate long-term concentrations of traffic-related nitrogen oxides and fine particulate matter. , 2007, Environmental science & technology.

[18]  Jack Sklansky,et al.  A note on genetic algorithms for large-scale feature selection , 1989, Pattern Recognition Letters.

[19]  J. Schwartz,et al.  Lung function and chronic exposure to air pollution: a cross-sectional analysis of NHANES II. , 1989, Environmental research.

[20]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[21]  Alex Alves Freitas,et al.  Attribute Selection with a Multi-objective Genetic Algorithm , 2002, SBIA.

[22]  T. Louis,et al.  Model choice in time series studies of air pollution and mortality , 2006 .

[23]  Hisao Ishibuchi,et al.  Multi-objective pattern and feature selection by a genetic algorithm , 2000, GECCO.

[24]  R. K. Ursem Multi-objective Optimization using Evolutionary Algorithms , 2009 .

[25]  Kenneth A. De Jong,et al.  Genetic algorithms as a tool for feature selection in machine learning , 1992, Proceedings Fourth International Conference on Tools with Artificial Intelligence TAI '92.

[26]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[27]  Andrew Hunter,et al.  A multi-objective genetic algorithm approach to feature selection in neural and fuzzy modeling , 2001 .

[28]  Piotr Holnicki,et al.  Burden of Mortality and Disease Attributable to Multiple Air Pollutants in Warsaw, Poland , 2017, International journal of environmental research and public health.

[29]  Lalana Kagal,et al.  Explaining Explanations: An Overview of Interpretability of Machine Learning , 2018, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA).

[30]  Kalyanmoy Deb,et al.  Multiclass protein fold recognition using multiobjective evolutionary algorithms , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[31]  Fernando Jiménez,et al.  Simple Versus Composed Temporal Lag Regression with Feature Selection, with an Application to Air Quality Modeling , 2020, 2020 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS).

[32]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[33]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[34]  L. Lave,et al.  Effect of the Fine Fraction of Particulate Matter versus the Coarse Mass and Other Pollutants on Daily Mortality in Santiago, Chile , 2000, Journal of the Air & Waste Management Association.

[35]  Guido Sciavicco,et al.  Knowledge Extraction with Interval Temporal Logic Decision Trees , 2023, TIME.

[36]  Hitoshi Iba,et al.  Selecting informative genes using a multiobjective evolutionary algorithm , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[37]  Fernando Jiménez,et al.  Multi-objective evolutionary feature selection for online sales forecasting , 2017, Neurocomputing.

[38]  J. Gulliver,et al.  A review of land-use regression models to assess spatial variation of outdoor air pollution , 2008 .

[39]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[40]  Ujjwal Maulik,et al.  A Survey of Multiobjective Evolutionary Algorithms for Data Mining: Part I , 2014, IEEE Transactions on Evolutionary Computation.