Handling high-dimensional data in air pollution forecasting tasks

In the paper methods aimed at handling high-dimensional weather forecasts data used to predict the concentrations of PM10, PM2.5, SO2, NO, CO and O3 are being proposed. The procedure employed to predict pollution normally requires historical data samples for a large number of points in time — particularly weather forecast data, actual weather data and pollution data. Likewise, it typically involves using numerous features related to atmospheric conditions. Consequently the analysis of such datasets to generate accurate forecasts becomes very cumbersome task. The paper examines a variety of unsupervised dimensionality reduction methods aimed at obtaining compact yet informative set of features. As an alternative, approach using fractional distances for data analysis tasks is being considered as well. Both strategies were evaluated on real-world data obtained from the Institute of Meteorology and Water Management in Katowice (Poland), with extended Air Pollution Forecast Model (e-APFM) being used as underlying prediction tool. It was found that employing fractional distance as a dissimilarity measure ensures the best accuracy of forecasting. Satisfactory results can be also obtained with Isomap, Landmark Isomap and Factor Analysis as dimensionality reduction techniques. These methods can be also used to formulate universal mapping, ready-to-use for data gathered at different geographical areas.

[1]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[2]  Martin Dugas,et al.  Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data , 2010, BMC Bioinformatics.

[3]  Stanislaw Osowski,et al.  Forecasting of the daily meteorological pollution using wavelets and support vector machine , 2007, Eng. Appl. Artif. Intell..

[4]  Ferdinand Baer,et al.  Numerical weather prediction , 2000, Adv. Comput..

[5]  Yang Zhang,et al.  Real-time air quality forecasting, part I: History, techniques, and current status , 2012 .

[6]  Hongyuan Zha,et al.  Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent Space Alignment , 2002, ArXiv.

[7]  Yang Zhang,et al.  Real-time air quality forecasting, part II: State of the science, current research needs, and future prospects , 2012 .

[8]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[9]  Nenad Tomašev,et al.  Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification , 2014 .

[10]  İrem Uçal Sarı,et al.  Forecasting Energy Demand Using Fuzzy Seasonal Time Series , 2012 .

[11]  Stephen Wiggins,et al.  ENSO dynamics in current climate models: an investigation using nonlinear dimensionality reduction , 2008 .

[12]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[13]  Runhe Shi,et al.  Ensemble and enhanced PM10 concentration forecast model based on stepwise regression and wavelet analysis , 2013 .

[14]  C. Spearman General intelligence Objectively Determined and Measured , 1904 .

[15]  Dimitris K. Agrafiotis,et al.  Stochastic proximity embedding , 2003, J. Comput. Chem..

[16]  Piotr Kulczycki,et al.  An Algorithm for Sample and Data Dimensionality Reduction Using Fast Simulated Annealing , 2011, ADMA.

[17]  Renkuan Guo,et al.  Predicting air pollution using fuzzy membership grade Kriging , 2007, Comput. Environ. Urban Syst..

[18]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[19]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[20]  Sherri Rose Big data and the future , 2012 .

[21]  D. Domanska,et al.  Explorative forecasting of air pollution , 2014 .

[22]  Pedro Larrañaga,et al.  Feature Subset Selection by Bayesian network-based optimization , 2000, Artif. Intell..

[23]  Milt Statheropoulos,et al.  Principal component and canonical correlation analysis for examining air pollution and meteorological data , 1998 .

[24]  Michel Verleysen,et al.  Choosing the Metric: A Simple Model Approach , 2011, Meta-Learning in Computational Intelligence.

[25]  I. Jolliffe Principal Component Analysis , 2002 .

[26]  Cherukuri Aswani Kumar,et al.  Analysis of unsupervised dimensionality reduction techniques , 2009, Comput. Sci. Inf. Syst..

[27]  Salah Zidi,et al.  Feature extraction for atmospheric pollution detection , 2011, 2011 International Conference on Communications, Computing and Control Applications (CCCA).

[28]  Ayse Betül Oktay,et al.  Forecasting air pollutant indicator levels with geographic models 3 days in advance using neural networks , 2010, Expert Syst. Appl..

[29]  G. Tuncel,et al.  Air pollution forecasting in Ankara, Turkey using air pollution index and its relation to assimilative capacity of the atmosphere , 2010, Environmental monitoring and assessment.

[30]  Deli Zhao,et al.  Linear local tangent space alignment and application to face recognition , 2007, Neurocomputing.

[31]  Nikolaos M. Avouris,et al.  Feature selection for air quality forecasting: a genetic algorithm approach , 2003, AI Commun..

[32]  Olcay Kursun,et al.  Feature Selection For The Prediction Of Tropospheric Ozone Concentration Using A Wrapper Method , 2011, Intell. Autom. Soft Comput..

[33]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[34]  Alireza Sarveniazi An Actual Survey of Dimensionality Reduction , 2014 .

[35]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[36]  Piotr Kulczycki,et al.  An algorithm for reducing the dimension and size of a sample for data exploration procedures , 2014, Int. J. Appl. Math. Comput. Sci..

[37]  Wei Sun,et al.  Prediction of 8 h-average ozone concentration using a supervised hidden Markov model combined with generalized linear models , 2013 .

[38]  J. Kukkonen,et al.  Intercomparison of air quality data using principal component analysis, and forecasting of PM₁₀ and PM₂.₅ concentrations using artificial neural networks, in Thessaloniki and Helsinki. , 2011, The Science of the total environment.

[39]  Alireza Talaei,et al.  Predicting oil price movements: A dynamic Artificial Neural Network approach , 2014 .

[40]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[41]  G. Sudheer,et al.  Short term load forecasting using wavelet transform combined with Holt–Winters and weighted nearest neighbor models , 2015 .

[42]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[43]  Rogério Cid Bastos,et al.  Uncertainty analysis in political forecasting , 2006, Decis. Support Syst..

[44]  George Papadourakis,et al.  Understanding and forecasting atmospheric quality parameters with the aid of ANNs , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[45]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[46]  H. Zha,et al.  Principal manifolds and nonlinear dimensionality reduction via tangent space alignment , 2004, SIAM J. Sci. Comput..

[47]  José R. Dorronsoro,et al.  Diffusion Maps for the Description of Meteorological Data , 2012, HAIS.

[48]  Joshua B. Tenenbaum,et al.  Global Versus Local Methods in Nonlinear Dimensionality Reduction , 2002, NIPS.

[49]  Qi Li,et al.  Artificial neural networks forecasting of PM2.5 pollution using air mass trajectory based geographic model and wavelet transformation , 2015 .

[50]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[51]  Bogumił Jakubiak,et al.  Implementation and Research on the Operational Use of the Mesoscale Prediction Model COAMPS in Poland , 2006 .

[52]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[53]  Stan Lipovetsky,et al.  Dimensionality reduction for data of unknown cluster structure , 2016, Inf. Sci..

[54]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[55]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[56]  Shuicheng Yan,et al.  Neighborhood preserving embedding , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[57]  John Platt,et al.  FastMap, MetricMap, and Landmark MDS are all Nystrom Algorithms , 2005, AISTATS.

[58]  Francesco Camastra,et al.  Data dimensionality estimation methods: a survey , 2003, Pattern Recognit..

[59]  Sahil Shah,et al.  Predicting stock and stock price index movement using Trend Deterministic Data Preparation and machine learning techniques , 2015, Expert Syst. Appl..

[60]  Yong Liu,et al.  A novel hybrid forecasting model for PM₁₀ and SO₂ daily concentrations. , 2015, The Science of the total environment.

[61]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[62]  P. Goyal,et al.  Neuro-Fuzzy approach to forecasting Ozone Episodes over the urban area of Delhi, India , 2016 .

[63]  P. Mlakar Determination of features for air pollution forecasting models , 1997, Proceedings Intelligent Information Systems. IIS'97.

[64]  Yu Zheng,et al.  U-Air: when urban air quality inference meets big data , 2013, KDD.

[65]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[66]  Ming Li,et al.  Forecasting Fine-Grained Air Quality Based on Big Data , 2015, KDD.

[67]  Christopher J. C. Burges,et al.  Dimension Reduction: A Guided Tour , 2010, Found. Trends Mach. Learn..

[68]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[69]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[70]  Diana Domanska,et al.  Application of fuzzy time series models for forecasting pollution concentrations , 2012, Expert Syst. Appl..

[71]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.