Time series clustering by a robust autoregressive metric with application to air pollution

Abstract In this paper, following a fuzzy approach and adopting an autoregressive parameterization, we propose a robust clustering model for classifying time series. In particular, by adopting a fuzzy partitioning around medoids approach, the suggested clustering model is able to define the so-called medoid time series, which is a representative time series of each cluster, and the membership degrees of each time series to the different clusters. The robustness of the proposed clustering model is guaranteed by the adoption of a suitable robust metric for time series, i.e. the so-called exponential distance measure. In this way, the clustering model is able to tolerate the presence of outlier time series in the clustering process. In particular, it is capable of neutralizing and smoothing the disruptive effect of outlier time series, preserving the original clustering structure of the dataset, by assigning to outlier time series approximately the same membership degrees across clusters. To illustrate the usefulness and effectiveness of the suggested time series clustering model, a simulation study and an application to air pollution time series are carried out. Comparison with some existing clustering procedures suggested in the literature shows several advantages of the proposed model.

[1]  Pierpaolo D'Urso,et al.  Fuzzy Clustering for Data Time Arrays With Inlier and Outlier Time Trajectories , 2005, IEEE Transactions on Fuzzy Systems.

[2]  Bruce McCune,et al.  Diurnal curves of tropospheric ozone in the western United States , 1991 .

[3]  Christine F. Braban,et al.  The application of hierarchical cluster analysis and non-negative matrix factorization to European atmospheric monitoring site classification , 2014 .

[4]  D. Piccolo A DISTANCE MEASURE FOR CLASSIFYING ARIMA MODELS , 1990 .

[5]  M. L. Sanchez Gomez,et al.  Application of cluster analysis to identify sources of airborne particles , 1987 .

[6]  J. Adame,et al.  Application of cluster analysis to surface ozone, NO₂ and SO₂ daily patterns in an industrial area in Central-Southern Spain measured with a DOAS system. , 2012, The Science of the total environment.

[7]  Ujjwal Kumar,et al.  ARIMA forecasting of ambient air pollutants (O3, NO, NO2 and CO) , 2010 .

[8]  K. Chou,et al.  iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. , 2013, Analytical biochemistry.

[9]  José C.M. Pires,et al.  Management of air quality monitoring using principal component and cluster analysis—Part I: SO2 and PM10 , 2008 .

[10]  Agma J. M. Traina,et al.  Accelerating k-medoid-based algorithms through metric access methods , 2008, J. Syst. Softw..

[11]  K. Chou,et al.  iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. , 2013, Journal of theoretical biology.

[12]  Elizabeth Ann Maharaj,et al.  Fuzzy clustering of time series in the frequency domain , 2011, Inf. Sci..

[13]  Kuo-Chen Chou,et al.  GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. , 2011, Molecular bioSystems.

[14]  Armistead G Russell,et al.  Characterization of Spatially Homogeneous Regions Based on Temporal Patterns of Fine Particulate Matter in the Continental United States , 2008, Journal of the Air & Waste Management Association.

[15]  Anne M. Thompson,et al.  Aircraft vertical profiles of trace gas and aerosol pollution over the mid‐Atlantic United States: Statistics and meteorological cluster analysis , 2006 .

[16]  R. J. Yamartino,et al.  A new air quality regime classification scheme for O3, NO2, SO2 and PM10 observations sites , 2005 .

[17]  K. Chou,et al.  iGPCR-Drug: A Web Server for Predicting Interaction between GPCRs and Drugs in Cellular Networking , 2013, PloS one.

[18]  Stefano Federico Tonellato,et al.  Looking for similar patterns among monitoring stations. Venice Lagoon application , 2011 .

[19]  Isidro A. Pérez,et al.  Forecasting particulate pollutant concentrations in a city from meteorological variables and regional weather patterns , 1990 .

[20]  Pierpaolo D'Urso,et al.  Dissimilarity measures for time trajectories , 2000 .

[21]  Walter Ruijgrok,et al.  Aspects of wet, acidifying deposition in Arnhem: Source regions, correlations and trends (1984–1991) , 1993 .

[22]  Joseph P. Pinto,et al.  A Comparative Study of PM2.5 Ambient Aerosol Chemical Databases , 1998 .

[23]  Sebastien Rauch,et al.  Impact of automobile emissions on the levels of platinum and lead in Accra, Ghana. , 2003, Journal of environmental monitoring : JEM.

[24]  V. Joshi,et al.  Cluster analysis of Delhi's ambient air quality data. , 2003, Journal of environmental monitoring : JEM.

[25]  Elizabeth Ann Maharaj,et al.  Time-Series Clustering , 2015 .

[26]  S. Samarasinghe,et al.  Complex time series analysis of PM10 and PM2.5 for a coastal site using artificial neural network modelling and k-means clustering , 2014 .

[27]  Sol M. Shatz,et al.  A petri net framework for automated static analysis of Ada tasking behavior , 1988, J. Syst. Softw..

[28]  Kuo-Chen Chou,et al.  Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. , 2007, Protein and peptide letters.

[29]  Kuo-Chen Chou,et al.  Using supervised fuzzy clustering to predict protein structural classes. , 2005, Biochemical and biophysical research communications.

[30]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[31]  Ahmet Palazoglu,et al.  A cluster aggregation scheme for ozone episode selection in the San Francisco, CA Bay Area , 2006 .

[32]  Jorge Caiado,et al.  A periodogram-based metric for time series classification , 2006, Comput. Stat. Data Anal..

[33]  Maqsood Hayat,et al.  Discriminating outer membrane proteins with Fuzzy K-nearest Neighbor algorithms based on the general form of Chou's PseAAC. , 2012, Protein and peptide letters.

[34]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[35]  Joseph E. Cavanaugh,et al.  State‐space discrimination and clustering of atmospheric time series data based on Kullback information measures , 2008 .

[36]  F. Hosseinibalam,et al.  Statistical models and time series forecasting of sulfur dioxide: a case study Tehran , 2009, Environmental monitoring and assessment.

[37]  C. Abraham,et al.  Unsupervised Curve Clustering using B‐Splines , 2003 .

[38]  Ahmet Palazoglu,et al.  Sequencing diurnal air flow patterns for ozone exposure assessment around Houston, Texas , 2009 .

[39]  Rosaria Ignaccolo,et al.  Functional zoning for air quality , 2013, Environmental and Ecological Statistics.

[40]  Eric Mayer,et al.  Analytical determination and classification of pollutant concentration fields using air pollution monitoring network data: Methodology and application in the Paris area, during episodes with peak nitrogen dioxide levels , 2000, Environ. Model. Softw..

[41]  Wing-tat Hung,et al.  Interpretation of air quality in relation to monitoring station's surroundings , 2009 .

[42]  Kuo-Chen Chou,et al.  iNR-Drug: Predicting the Interaction of Drugs with Nuclear Receptors in Cellular Networking , 2014, International journal of molecular sciences.

[43]  Pedro Oyola,et al.  Examination of pollution trends in Santiago de Chile with cluster analysis of PM10 and Ozone data , 2006 .

[44]  Marcella Corduas,et al.  Time series clustering and classification by the autoregressive metric , 2008, Comput. Stat. Data Anal..

[45]  P. Koutrakis,et al.  Source apportionment of urban particulate aliphatic and polynuclear aromatic hydrocarbons (PAHs) using multivariate methods. , 2001, Environmental science & technology.

[46]  Claudio Silva,et al.  Optimization of the atmospheric pollution monitoring network at Santiago de Chile , 2003 .

[47]  Bonnie K. Ray,et al.  Point source influence on observed extreme pollution levels in a monitoring network , 2014 .

[48]  Thomas A. Cahill,et al.  Determination of elemental concentrations in atmospheric aerosols in mexico city using proton induced x-ray emission, proton elastic scattering, and laser absorption , 1994 .

[49]  Kuo-Chen Chou,et al.  Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. , 2006, Journal of theoretical biology.

[50]  K. Chou,et al.  iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels , 2014, BioMed research international.

[51]  Anupam Joshi,et al.  Low-complexity fuzzy relational clustering algorithms for Web mining , 2001, IEEE Trans. Fuzzy Syst..

[52]  Pia Anttila,et al.  Trends of primary and secondary pollutant concentrations in Finland in 1994-2007 , 2010 .

[53]  Elena Austin,et al.  A framework for identifying distinct multipollutant profiles in air pollution data. , 2012, Environment International.

[54]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[55]  Doulaye Dembélé,et al.  Fuzzy C-means Method for Clustering Microarray Data , 2003, Bioinform..

[56]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[57]  Kuo-Chen Chou,et al.  NR-2L: A Two-Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features , 2011, PloS one.

[58]  Trevor D. Davies,et al.  Cluster analysis: A technique for estimating the synoptic meteorological controls on air and precipitation chemistry—Method and applications , 1992 .

[59]  Pierpaolo D'Urso,et al.  Wavelet‐based self‐organizing maps for classifying multivariate time series , 2014 .

[60]  Elizabeth Ann Maharaj,et al.  Autoregressive model-based fuzzy clustering and its application for detecting information redundancy in air pollution monitoring networks , 2013, Soft Comput..

[61]  J. Nieto,et al.  Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition. , 2009, Journal of theoretical biology.

[62]  F. L. Ludwig,et al.  Classification of ozone and weather patterns associated with high ozone concentrations in the san francisco and monterey bay areas , 1995 .

[63]  Gabriel Ibarra-Berastegi,et al.  Assessing spatial variability of SO2 field as detected by an air quality network using Self-Organizing Maps, cluster, and Principal Component Analysis , 2009 .

[64]  Antonella Zanobetti,et al.  A framework to spatially cluster air pollution monitoring sites in US based on the PM2.5 composition. , 2013, Environment international.

[65]  M. Bedogni,et al.  The ozone patterns in the aerological basin of Milan (Italy) , 1996 .

[66]  M. Chaparro,et al.  Biomonitors of urban air pollution: Magnetic studies and SEM observations of corticolous foliose and microfoliose lichens and their suitability for magnetic monitoring. , 2013, Environmental pollution.

[67]  Elizabeth Ann Maharaj,et al.  Wavelet-based Fuzzy Clustering of Time Series , 2010, J. Classif..

[68]  Miin-Shen Yang,et al.  Alternative c-means clustering algorithms , 2002, Pattern Recognit..

[69]  Wei-Zhen Lu,et al.  Performance assessment of air quality monitoring networks using principal component analysis and clu , 2011 .

[70]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[71]  Jacques Lapointe,et al.  Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers , 2013 .

[72]  K. Chou,et al.  iMethyl-PseAAC: Identification of Protein Methylation Sites via a Pseudo Amino Acid Composition Approach , 2014, BioMed research international.

[73]  Dao-Qiang Zhang,et al.  A comment on "Alternative c-means clustering algorithms" , 2004, Pattern Recognit..

[74]  Thomas A. Runkler,et al.  Alternating cluster estimation: a new tool for clustering and function approximation , 1999, IEEE Trans. Fuzzy Syst..

[75]  Dimitris N. Georgiou,et al.  A Short Survey on Genetic Sequences, Chou’s Pseudo Amino Acid Composition and its Combination with Fuzzy Set Theory , 2013 .

[76]  Ricardo J. G. B. Campello,et al.  A fuzzy extension of the silhouette width criterion for cluster analysis , 2006, Fuzzy Sets Syst..

[77]  Elizabeth Ann Maharaj,et al.  Wavelets-based clustering of multivariate time series , 2012, Fuzzy Sets Syst..

[78]  Ferhat Karaca,et al.  Distant source contributions to PM10 profile evaluated by SOM based cluster analysis of air mass trajectory sets , 2010 .

[79]  Marialuisa Volta,et al.  A methodology for seasonal photochemical model simulation assessment , 2005 .

[80]  Andrew C. Comrie,et al.  An All-Season Synoptic Climatology of Air Pollution in the U.S.-Mexico Border Region* , 1996 .

[81]  Sushmita Mitra An evolutionary rough partitive clustering , 2004, Pattern Recognit. Lett..

[82]  Pierpaolo D’Urso,et al.  Autocorrelation-based fuzzy clustering of time series , 2009, Fuzzy Sets Syst..

[83]  K. Chou,et al.  iSNO-PseAAC: Predict Cysteine S-Nitrosylation Sites in Proteins by Incorporating Position Specific Amino Acid Propensity into Pseudo Amino Acid Composition , 2013, PloS one.

[84]  P. D’Urso,et al.  Noise fuzzy clustering of time series by autoregressive metric , 2013 .

[85]  Luis Angel García-Escudero,et al.  A Proposal for Robust Curve Clustering , 2005, J. Classif..

[86]  Isabella Morlini,et al.  Searching for structure in measurements of air pollutant concentration , 2007 .

[87]  J. Morales,et al.  Heavy metals in the atmosphere coming from a copper smelter in Chile , 1994 .

[88]  A. Gordaliza,et al.  Robustness Properties of k Means and Trimmed k Means , 1999 .

[89]  Wei Chen,et al.  iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. , 2014, Analytical biochemistry.

[90]  S. Ghigo,et al.  Analysis of air quality monitoring networks by functional clustering , 2008 .

[91]  Kuo-Chen Chou,et al.  Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern. , 2008, Journal of theoretical biology.

[92]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .