Generalised linear model-based algorithm for detection of outliers in environmental data and comparison with semi-parametric outlier detection methods

Abstract Outliers are often present in large datasets of air pollutant concentrations. Existing methods for detection of outliers in environmental data can be divided as follows into three groups depending on the character of the data: methods for time series, methods for time series measured simultaneously with accompanying variables and methods for spatial data. A number of methods suggested for the automatic detection of outliers in time series data are limited by assumptions of known distribution of the analysed variable. Since the environmental variables are often influenced by accompanying factors their distribution is difficult to estimate. Considering the known information about accompanying variables and using appropriate methods for detection of outliers in time series measured simultaneously with accompanying variables can be a significant improvement in outlier detection approaches. This paper presents a method for the automatic detection of outliers in PM10 aerosols measured simultaneously with accompanying variables. The method is based on generalised linear model and subsequent analysis of the residuals. The method makes use of the benefits from the additional information included in the accessibility of accompanying variables. The results of the suggested procedure are compared with the results obtained using two distribution-free outlier detection methods for time series formerly suggested by the authors. The simulations-based comparison of the performance of all three procedures showed that the procedure presented in this paper effectively detects outliers that deviate at least 5 standard deviations from the mean value of the neighbouring observations and outperforms both distribution-free outlier detection methods for time series.

[1]  Joel Schwartz,et al.  REVIEW OF EPIDEMIOLOGICAL EVIDENCE OF HEALTH EFFECTS OF PARTICULATE AIR POLLUTION , 1995 .

[2]  Daniel G. Sbarbaro-Hofer,et al.  Outliers detection in environmental monitoring databases , 2011, Eng. Appl. Artif. Intell..

[3]  Siegfried Hörmann,et al.  Analysis and prediction of particulate matter PM10 for the winter season in Graz , 2016 .

[4]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[5]  P. Mikuška,et al.  Characterization of organic compounds in winter PM1 aerosols in a small industrial town , 2017 .

[6]  Akira Kondo,et al.  Effect of spatial outliers on the regression modelling of air pollutant concentrations: A case study in Japan , 2017 .

[7]  Martina Čampulová Comparison of Methods for Smoothing Environmental Data with an Application to Particulate Matter PM10 , 2018 .

[8]  D. Dockery,et al.  Health Effects of Fine Particulate Air Pollution: Lines that Connect , 2006, Journal of the Air & Waste Management Association.

[9]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[10]  Martina Čampulová,et al.  Control chart and Six sigma based algorithms for identification of outliers in experimental data, with an application to particulate matter PM10 , 2017 .

[11]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[12]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[13]  J. Teugels,et al.  Statistics of Extremes , 2004 .

[14]  Outlier detection in PM 10 aerosols by generalised linear model , 2018 .

[15]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[16]  Yiyuan She,et al.  Outlier Detection Using Nonconvex Penalized Regression , 2010, ArXiv.

[17]  Ki-Hyun Kim,et al.  A review on the human health impact of airborne particulate matter. , 2015, Environment international.

[18]  Norshahida Shaadan,et al.  Anomaly detection and assessment of PM10 functional data at several locations in the Klang Valley, Malaysia , 2015 .

[19]  A. Wheeler,et al.  Intra-urban correlation and spatial variability of air toxics across an international airshed in Detroit, Michigan (USA) and Windsor, Ontario (Canada) , 2010 .

[20]  Vic Barnett,et al.  Environmental Statistics: Methods and Applications , 2004 .

[21]  P. Mikuška,et al.  Seasonal variability of monosaccharide anhydrides, resin acids, methoxyphenols and saccharides in PM2.5 in Brno, the Czech Republic , 2017 .

[22]  Xiaohong Xu,et al.  Identification and influence of spatio-temporal outliers in urban air quality measurements. , 2016, The Science of the total environment.

[23]  Siegfried Hörmann,et al.  Quality and performance of a PM10 daily forecasting model , 2008 .

[24]  D. Walshaw,et al.  Sea-surge and wind speed extremes: optimal estimation strategies for planners and engineers , 2016, Stochastic Environmental Research and Risk Assessment.

[25]  Jaroslav Michálek,et al.  Analysis of daily average PM10 predictions by generalized linear models in Brno, Czech Republic , 2014 .

[26]  B. O'Leary,et al.  Modeling spatiotemporal variability of intra-urban air pollutants in Detroit: A pragmatic approach , 2014 .

[27]  A. J. Morris,et al.  Non-linear projection to latent structures revisited (the neural network PLS algorithm) , 1999 .

[28]  Peter Filzmoser,et al.  Identification of Multivariate Outliers: A Performance Study , 2016 .

[29]  K. Senthamarai Kannan,et al.  Multiple Linear Regression Models in Outlier Detection , 2012 .

[30]  Lian-kui Dai,et al.  Partial least squares with outlier detection in spectral analysis: A tool to predict gasoline properties , 2009 .

[31]  Rae Zimmerman,et al.  Asthma Hospital Admissions and Ambient Air Pollutant Concentrations in New York City , 2012 .

[32]  Vanda M. Lourenço,et al.  M-regression, false discovery rates and outlier detection with application to genetic association studies , 2014, Comput. Stat. Data Anal..

[33]  N. Bingham,et al.  Generalised Linear Models , 2010 .

[34]  Bobbia Michel,et al.  Spatial outlier detection in the PM10 monitoring network of Normandy (France) , 2015 .

[35]  Mauro Naghettini,et al.  On some aspects of peaks-over-threshold modeling of floods under nonstationarity using climate covariates , 2015, Stochastic Environmental Research and Risk Assessment.

[36]  Zuzana Hrdličková,et al.  Identification of Factors Affecting Air Pollution by Dust Aerosol PM10 in Brno City, Czech Republic , 2008 .

[37]  Abrutzky Rosana,et al.  Health Effects of Climate and Air Pollution in Buenos Aires: A First Time Series Analysis * , 2012 .

[38]  David J. Spiegelhalter,et al.  A Simple Diagnostic Plot Connecting Robust Estimation, Outlier Detection, and False Discovery Rates , 2006 .

[39]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[40]  H. Akaike A new look at the statistical model identification , 1974 .

[41]  Martina Čampulová,et al.  Semiparametric outlier detection in nonstationary times series: Case study for atmospheric pollution in Brno, Czech Republic , 2018 .

[42]  M. Otto,et al.  Outliers in Time Series , 1972 .

[43]  Philip Demokritou,et al.  Measurements of PM10 and PM2.5 particle concentrations in Athens, Greece , 2003 .