Semiparametric outlier detection in nonstationary times series: Case study for atmospheric pollution in Brno, Czech Republic

Abstract Large environmental datasets usually include outliers which can have significant effects on further analysis and modelling. There exist various outlier detection methods that depend on the distribution of the analysed variable. However quite often the distribution of environmental variables can not be estimated. This paper presents an approach for identification of outliers in environmental time series which does not impose restrictions on the distribution of observed variables. The suggested algorithm combines kernel smoothing and extreme value estimation techniques for stochastic processes within considerations of nonstationary expected value of the process. The nonstationarity in variance is evaded by change point analysis which precedes the proposed algorithm. Possible outliers are identified as observations with rare occurrence and, in correspondence to extreme value methodology, the confidence limits for high values of observed variables are constructed. The proposed methodology can be especially convenient for cases where validation of the data has to be carried out manually, since it significantly reduces the number of implausible observations. For a case study, the technique is applied for outlier detection in time series of hourly PM 10 concentrations in Brno, Czech Republic. The methodology is derived on solid theoretical results and seems to perform well for the series of PM 10 . However its flexibility makes it generally applicable not only to series of atmospheric pollutants. On the other hand, the choice of return level turns out to be crucial in sensitivity to the outliers. This issue should be left to the practitioners to decide with respect to specific application conditions.

[1]  P. Solomon,et al.  Airborne Particulate Matter and Human Health: A Review , 2005 .

[2]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[3]  I E Auger,et al.  Algorithms for the optimal identification of segment neighborhoods. , 1989, Bulletin of mathematical biology.

[4]  Frederico Caeiro,et al.  Semi-parametric tail inference through probability-weighted moments , 2011 .

[5]  M. Otto,et al.  Outliers in Time Series , 1972 .

[6]  P. Mikuška,et al.  Characterization of organic compounds in winter PM1 aerosols in a small industrial town , 2017 .

[7]  A. Scott,et al.  A Cluster Analysis Method for Grouping Means in the Analysis of Variance , 1974 .

[8]  Pfeffer Ulrich,et al.  Intercomparison Exercise for Heavy Metals in PM10 , 2008 .

[9]  Debbie J. Dupuis,et al.  Large wind speeds: Modeling and outlier detection , 2004 .

[10]  J. Teugels,et al.  Statistics of Extremes , 2004 .

[11]  Mauro Naghettini,et al.  On some aspects of peaks-over-threshold modeling of floods under nonstationarity using climate covariates , 2015, Stochastic Environmental Research and Risk Assessment.

[12]  Elizabeth Ann Maharaj,et al.  Fuzzy clustering of time series using extremes , 2017, Fuzzy Sets Syst..

[13]  Andrés M. Alonso,et al.  Comparing generalized Pareto models fitted to extreme observations: an application to the largest temperatures in Spain , 2014, Stochastic Environmental Research and Risk Assessment.

[14]  S. Roberts Novelty detection using extreme value statistics , 1999 .

[15]  Vic Barnett,et al.  Environmental Statistics: Methods and Applications , 2004 .

[16]  M. Gerboles,et al.  Interlaboratory comparison exercise for the determination of As, Cd, Ni and Pb in PM10 in Europe , 2011 .

[17]  Zuzana Hrdličková,et al.  Identification of Factors Affecting Air Pollution by Dust Aerosol PM10 in Brno City, Czech Republic , 2008 .

[18]  Johan Segers,et al.  Inference for clusters of extreme values , 2003 .

[19]  Salvatore J. Stolfo,et al.  Adaptive Intrusion Detection: A Data Mining Approach , 2000, Artificial Intelligence Review.

[20]  Abrutzky Rosana,et al.  Health Effects of Climate and Air Pollution in Buenos Aires: A First Time Series Analysis * , 2012 .

[21]  Ashish Sharma,et al.  Selection of a kernel bandwidth for measuring dependence in hydrologic time series using the mutual information criterion , 2001 .

[22]  Robert K. Goodrich,et al.  An Algorithm for Classification and Outlier Detection of Time-Series Data , 2010 .

[23]  Idris A. Eckley,et al.  changepoint: An R Package for Changepoint Analysis , 2014 .

[24]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[25]  Jonathan A. Tawn,et al.  A Comparison of Methods for Estimating the Extremal Index , 2000 .

[26]  Norshahida Shaadan,et al.  Anomaly detection and assessment of PM10 functional data at several locations in the Klang Valley, Malaysia , 2015 .

[27]  P. Burridge,et al.  Additive Outlier Detection Via Extreme‐Value Theory , 2006 .

[28]  Rae Zimmerman,et al.  Asthma Hospital Admissions and Ambient Air Pollutant Concentrations in New York City , 2012 .

[29]  Pierpaolo D'Urso,et al.  Time series clustering by a robust autoregressive metric with application to air pollution , 2015 .

[30]  J. Michálek,et al.  Comparison of precipitation extremes estimation using parametric and nonparametric methods , 2016 .

[31]  Pierpaolo D'Urso,et al.  Autoregressive metric-based trimmed fuzzy clustering with an application to PM10 time series , 2017 .

[32]  Roberto Mínguez,et al.  Regression Models for Outlier Identification (Hurricanes and Typhoons) in Wave Hindcast Databases , 2012 .

[33]  T. Gasser,et al.  Locally Adaptive Bandwidth Choice for Kernel Regression Estimators , 1993 .

[34]  D. Walshaw,et al.  Sea-surge and wind speed extremes: optimal estimation strategies for planners and engineers , 2016, Stochastic Environmental Research and Risk Assessment.

[35]  L. Peng,et al.  A Bootstrap-based Method to Achieve Optimality in Estimating the Extreme-value Index , 2000 .

[36]  Richard L. Smith Maximum likelihood estimation in a class of nonregular cases , 1985 .

[37]  L. Haan,et al.  Extreme value theory : an introduction , 2006 .

[38]  D. Ruppert Empirical-Bias Bandwidths for Local Polynomial Nonparametric Regression and Density Estimation , 1997 .

[39]  H. Müller,et al.  Variable Bandwidth Kernel Estimators of Regression Curves , 1987 .

[40]  Cláudia Neves,et al.  Reiss and Thomas' automatic selection of the number of extremes , 2004, Comput. Stat. Data Anal..

[41]  Lee Fawcett,et al.  Estimating return levels from serially dependent extremes , 2012 .

[42]  T. Gasser,et al.  A Flexible and Fast Method for Automatic Smoothing , 1991 .

[43]  H. Müller,et al.  Kernels for Nonparametric Curve Estimation , 1985 .

[44]  Paul J. Northrop,et al.  Improved threshold diagnostic plots for extreme value analyses , 2014 .

[45]  Chen Zhou,et al.  Existence and consistency of the maximum likelihood estimator for the extreme value index , 2009, J. Multivar. Anal..

[46]  Joel Schwartz,et al.  REVIEW OF EPIDEMIOLOGICAL EVIDENCE OF HEALTH EFFECTS OF PARTICULATE AIR POLLUTION , 1995 .

[47]  Jaroslav Michálek,et al.  Analysis of daily average PM10 predictions by generalized linear models in Brno, Czech Republic , 2014 .

[48]  H. Madsen,et al.  Regional estimation of rainfall intensity‐duration‐frequency curves using generalized least squares regression of partial duration series statistics , 2002 .

[49]  Bobbia Michel,et al.  Spatial outlier detection in the PM10 monitoring network of Normandy (France) , 2015 .

[50]  Eva Herrmann,et al.  Local Bandwidth Choice in Kernel Regression Estimation , 1997 .

[51]  M. Wand,et al.  An Effective Bandwidth Selector for Local Least Squares Regression , 1995 .

[52]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[53]  Martina Čampulová,et al.  Control chart and Six sigma based algorithms for identification of outliers in experimental data, with an application to particulate matter PM10 , 2017 .

[54]  M. Süveges,et al.  Likelihood estimation of the extremal index , 2007 .

[55]  Clifford M. Hurvich,et al.  Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion , 1998 .

[56]  P. Northrop An efficient semiparametric maxima estimator of the extremal index , 2015, 1506.06831.