Distance measures for effective clustering of ARIMA time-series

Much environmental and socioeconomic time-series data can be adequately modeled using autoregressive integrated moving average (ARIMA) models. We call such time series "ARIMA time series". We propose the use of the linear predictive coding (LPC) cepstrum for clustering ARIMA time series, by using the Euclidean distance between the LPC cepstra of two time series as their dissimilarity measure. We demonstrate that LPC cepstral coefficients have the desired features for accurate clustering and efficient indexing of ARIMA time series. For example, just a few LPC cepstral coefficients are sufficient in order to discriminate between time series that are modeled by different ARIMA models. In fact, this approach requires fewer coefficients than traditional approaches, such as DFT (discrete Fourier transform) and DWT (discrete wavelet transform). The proposed distance measure can be used for measuring the similarity between different ARIMA models as well. We cluster ARIMA time series using the "partition around medoids" method with various similarity measures. We present experimental results demonstrating that, using the proposed measure, we achieve significantly better clusterings of ARIMA time series data as compared to clusterings obtained by using other traditional similarity measures, such as DFT, DWT, PCA (principal component analysis), etc. Experiments were performed both on simulated and real data.

[1]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[2]  Alberto O. Mendelzon,et al.  Similarity-based queries , 1995, PODS '95.

[3]  古井 貞煕,et al.  Digital speech processing, synthesis, and recognition , 1989 .

[4]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.

[5]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[6]  A. Nejat Ince,et al.  Digital Speech Processing , 1992 .

[7]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[8]  Robert Vích,et al.  Z Transform Theory and Applications , 1987 .

[9]  Dimitrios Gunopulos,et al.  Finding Similar Time Series , 1997, PKDD.

[10]  Zbigniew R. Struzik,et al.  Measuring time series similarity through large singular features revealed with wavelet transformation , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[11]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[12]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[13]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[14]  Dragomir Anguelov,et al.  Mining The Stock Market : Which Measure Is Best ? , 2000 .

[15]  Piotr Indyk,et al.  Mining the stock market (extended abstract): which measure is best? , 2000, KDD '00.

[16]  Davood Rafiei,et al.  On similarity-based queries for time series data , 1997, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).