A dimensionality reduction technique for efficient time series similarity analysis

We propose a dimensionality reduction technique for time series analysis that significantly improves the efficiency and accuracy of similarity searches. In contrast to piecewise constant approximation (PCA) techniques that approximate each time series with constant value segments, the proposed method--Piecewise Vector Quantized Approximation--uses the closest (based on a distance measure) codeword from a codebook of key-sequences to represent each segment. The new representation is symbolic and it allows for the application of text-based retrieval techniques into time series similarity analysis. Experiments on real and simulated datasets show that the proposed technique generally outperforms PCA techniques in clustering and similarity searches.

[1]  Magnus Lie Hetland A Survey of Recent Methods for Efficient Retrieval of Similar Time Sequences , 2001 .

[2]  R. J. Alcock,et al.  Time-Series Similarity Queries Employing a Feature-Based Approach , 1999 .

[3]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[4]  Piotr Indyk,et al.  Mining the stock market (extended abstract): which measure is best? , 2000, KDD '00.

[5]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[6]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[7]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[8]  Michel Verleysen,et al.  Vector quantization: a weighted version for time-series forecasting , 2005, Future Gener. Comput. Syst..

[9]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[10]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[11]  Lars Niklasson,et al.  Time series segmentation using an adaptive resource allocating vector quantization network based on change detection , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[12]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[13]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[14]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[15]  Christos Faloutsos,et al.  A signature technique for similarity-based queries , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[16]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[17]  Wesley W. Chu,et al.  Efficient searches for similar subsequences of different lengths in sequence databases , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[18]  James E. Fowler Adaptive vector quantization-Part I: a unifying structure , 1997, Proceedings DCC '97. Data Compression Conference.

[19]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[20]  Qiang Wang,et al.  A multiresolution symbolic representation of time series , 2005, 21st International Conference on Data Engineering (ICDE'05).

[21]  Hannu Toivonen,et al.  Mining for similarities in aligned time series using wavelets , 1999, Defense, Security, and Sensing.

[22]  K Lehnertz,et al.  Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[24]  Haixun Wang,et al.  Landmarks: a new model for similarity-based pattern querying in time series databases , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[25]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[26]  Gaston H. Gonnet,et al.  A fast algorithm on average for all-against-all sequence matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[27]  Eamonn J. Keogh,et al.  An Enhanced Representation of Time Series Which Allows Fast and Accurate Classification, Clustering and Relevance Feedback , 1998, KDD.

[28]  Qiang Wang,et al.  A dimensionality reduction technique for efficient similarity analysis of time series databases , 2004, CIKM '04.

[29]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[30]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[31]  Abraham Kandel,et al.  Data Mining in Time Series Database , 2004 .

[32]  Christos Faloutsos,et al.  Efficient and effective Querying by Image Content , 1994, Journal of Intelligent Information Systems.