Clustering time series from ARMA models with clipped data

Clustering time series is a problem that has applications in a wide variety of fields, and has recently attracted a large amount of research. In this paper we focus on clustering data derived from Autoregressive Moving Average (ARMA) models using k-means and k-medoids algorithms with the Euclidean distance between estimated model parameters. We justify our choice of clustering technique and distance metric by reproducing results obtained in related research. Our research aim is to assess the affects of discretising data into binary sequences of above and below the median, a process known as clipping, on the clustering of time series. It is known that the fitted AR parameters of clipped data tend asymptotically to the parameters for unclipped data. We exploit this result to demonstrate that for long series the clustering accuracy when using clipped data from the class of ARMA models is not significantly different to that achieved with unclipped data. Next we show that if the data contains outliers then using clipped data produces significantly better clusterings. We then demonstrate that using clipped series requires much less memory and operations such as distance calculations can be much faster. Finally, we demonstrate these advantages on three real world data sets.

[1]  Pradipta Sarkar,et al.  Practical Time Series , 2002, Technometrics.

[2]  D. Pauler The Schwarz criterion and related methods for normal linear models , 1998 .

[3]  E. Slud,et al.  On goodness of fit of time series models: An application of higher order crossings , 1981 .

[4]  R. Blender,et al.  Identification of cyclone‐track regimes in the North Atlantic , 1997 .

[5]  Padhraic Smyth,et al.  Trajectory clustering with mixtures of regression models , 1999, KDD '99.

[6]  Elizabeth Ann Maharaj,et al.  Cluster of Time Series , 2000, J. Classif..

[7]  C. Giovanni Galizia,et al.  Odor-Driven Attractor Dynamics in the Antennal Lobe Allow for Simple and Rapid Olfactory Pattern Classification , 2004, Neural Computation.

[8]  E. Parzen Some recent advances in time series modeling , 1974 .

[9]  Christos Faloutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[10]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Eamonn J. Keogh,et al.  A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases , 2000, PAKDD.

[12]  H. Akaike Likelihood of a model and information criteria , 1981 .

[13]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[14]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[15]  Eamonn J. Keogh,et al.  UCR Time Series Data Mining Archive , 1983 .

[16]  Paul R. Cohen,et al.  Using Dynamic Time Warping to Bootstrap HMM-Based Clustering of Time Series , 2001, Sequence Learning.

[17]  Pedro A. Morettin,et al.  The Levinson Algorithm and its Applications in Time Series Analysis , 1984 .

[18]  E. J. Godolphin,et al.  A direct representation for the large-sample maximum likelihood estimator of a Gaussian autoregressive-moving average process , 1984 .

[19]  Dimitrios Gunopulos,et al.  A Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series , 2003 .

[20]  Padhraic Smyth,et al.  Visualization of navigation patterns on a Web site using model-based clustering , 2000, KDD '00.

[21]  Tommi S. Jaakkola,et al.  A new approach to analyzing gene expression time series data , 2002, RECOMB '02.

[22]  Eamonn J. Keogh,et al.  Making Time-Series Classification More Accurate Using Learned Constraints , 2004, SDM.

[23]  T. Ulrich,et al.  Maximum entropy spectral analy-sis and autoregressive decomposition , 1975 .

[24]  Padhraic Smyth,et al.  Curve Clustering with Random Effects Regression Mixtures , 2003, AISTATS.

[25]  Elizabeth Ann Maharaj,et al.  A SIGNIFICANCE TEST FOR CLASSIFYING ARMA MODELS , 1996 .

[26]  Paul R. Cohen,et al.  Bayesian Clustering by Dynamics Contents 1 Introduction 1 2 Clustering Markov Chains 2 , 2022 .

[27]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[28]  Roberto Baragona,et al.  A simulation study on clustering time series with metaheuristic methods , 2001 .

[29]  Konstantinos Kalpakis,et al.  Distance measures for effective clustering of ARIMA time-series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[30]  Joydeep Ghosh,et al.  Model-based clustering with soft balancing , 2003, Third IEEE International Conference on Data Mining.

[31]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[32]  B. Kedem Estimation of the Parameters in Stationary Autoregressive Processes after Hard Limiting , 1980 .

[33]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[34]  A. Bagnall,et al.  Clustering Time Series from Mixture Polynomial Models with Discretised Data , 2003 .

[35]  H. Tong,et al.  Cluster of time series models: an example , 1990 .

[36]  Joydeep Ghosh,et al.  Scalable, Balanced Model-based Clustering , 2003, SDM.

[37]  J. P. Burg,et al.  Maximum entropy spectral analysis. , 1967 .

[38]  D. Piccolo A DISTANCE MEASURE FOR CLASSIFYING ARIMA MODELS , 1990 .

[39]  Dit-Yan Yeung,et al.  Mixtures of ARMA models for model-based time series clustering , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[40]  Dimitrios Gunopulos,et al.  Finding Similar Time Series , 1997, PKDD.

[41]  Padhraic Smyth,et al.  Clustering Sequences with Hidden Markov Models , 1996, NIPS.

[42]  K. Kosmelj,et al.  Cross-sectional approach for clustering time varying data , 1990 .

[43]  George K. Kokkinakis,et al.  Algorithm for clustering continuous density HMM by recognition error , 1996, IEEE Trans. Speech Audio Process..

[44]  Biing-Hwang Juang,et al.  Generalized mixture of HMMs for continuous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[45]  Arindam Banerjee,et al.  Clickstream clustering using weighted longest common subsequences , 2001 .