Enhancing the Symbolic Aggregate Approximation Method Using Updated Lookup Tables

Similarity search in time series data mining is a problem that has attracted increasing attention recently. The high dimensionality and large volume of time series databases make sequential scanning inefficient to tackle this problem. There are many representation techniques that aim at reducing the dimensionality of time series so that the search can be handled faster at a lower dimensional space level. Symbolic representation is one of the promising techniques, since symbolic representation methods try to benefit from the wealth of search algorithms used in bioinformatics and text mining communities. The symbolic aggregate approximation (SAX) is one of the most competitive methods in the literature. SAX utilizes a similarity measure that is easy to compute because it is based on pre-computed distances obtained from lookup tables. In this paper we present a new similarity measure that is almost as easy to compute as the original similarity measure, but it is tighter because it uses updated lookup tables. In addition, the new similarity measure is more intuitive than the original one. We conduct several experiments which show that the new similarity measure gives better results than the original one.

[1]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[2]  R. Larsen,et al.  An introduction to mathematical statistics and its applications (2nd edition) , by R. J. Larsen and M. L. Marx. Pp 630. £17·95. 1987. ISBN 13-487166-9 (Prentice-Hall) , 1987, The Mathematical Gazette.

[3]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[4]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[5]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[6]  Qiang Wang,et al.  A multiresolution symbolic representation of time series , 2005, 21st International Conference on Data Engineering (ICDE'05).

[7]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[8]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[9]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[10]  Toshiyuki Amagasa,et al.  The L - index: An indexing structure for ecient subsequence matching in time sequence databases , 2001 .

[11]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[12]  David B. Lomet,et al.  Foundations of Data Organization and Algorithms , 1993, Lecture Notes in Computer Science.

[13]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[14]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).