Fast Similarity Searches of Time-Warped SubSequences in Sequence Databases

Several indexing techniques have been proposed to process similarity queries in sequence databases. Most of them focus on finding similar sequences of the same length using the Euclidean distance metric. However, in some applications where the elements of sequences may be sampled at different rates, the time warping distance is a more suitable similarity measure. In this paper, we propose an indexing technique based on a suffix tree for fast retrieval of similar sub-sequences under time warping. The search algorithm for a suffix tree is extended to provide similarity searches, and the concept of categorization is applied to reduce index size and to accelerate query processing. A greater reduction of index size is achieved using a sparse suffix tree and more speed-up is attained by the fast estimation of the time warping distances between non-stored suffixes and a query sequence. Our method guarantees no false dismissals since the actual time warping distances are always lower-bound in the index space. Our access method can also be used to answer shapebased queries since approximate shapes of sub-sequences are maintained in the index space. Experiments on stock and artificial sequences show that our approach is about 4 times faster than sequential scanning with a relatively small index space, and the performance gains increase up to 20 times as the size of indexes grows.

[1]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[4]  Graham A. Stephen String Searching Algorithms , 1994, Lecture Notes Series on Computing.

[5]  Giuseppe Psaila,et al.  Querying Shapes of Histories , 1995, VLDB.

[6]  Dina Q. Goldin,et al.  On Similarity Queries for Time-Series Data: Constraint Specification and Implementation , 1995, CP.

[7]  Ricky K. Taira,et al.  KMeD: a Knowledge-based Multimedia Medical Distributed Database System , 1995, Inf. Syst..

[8]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[9]  Hagit Shatkay,et al.  Approximate queries and representations for large data sequences , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[10]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[11]  Nasser Yazdani,et al.  Matching and indexing sequences of different lengths , 1997, CIKM '97.

[12]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[13]  Christos Faloutsos,et al.  Efficient retrieval of similar time sequences under time warping , 1998, Proceedings 14th International Conference on Data Engineering.