Efficient Interval-focused Similarity Search under Dynamic Time Warping

Similarity search on time series from large temporal text corpora is interesting in many settings. Our use case is the Google Books Ngram corpus and historians interested in the changes of word frequencies over time. More specifically, users are interested in similarity search in a specific period of time, aka. interval-focused similarity search. Related work formalizes interval-focused similarity search, but the sparsely existing approaches are limited to metric distance measures, like the Euclidean distance. Most other approaches in this area, that address the usage of warping distance measures, focus on whole matching similarity search. In this work, we present a novel search tree that uses so-called time series envelopes to group objects. To speed up the tree traversal, our search tree approximates the envelopes based on the node height, i. e., envelopes are tighter further down in the tree. We combine this with various time series pruning techniques, mainly to reduce the number of expensive distance computations. Our experimental evaluation shows that this combination is worthwhile and indeed decisive for a significant speedup, compared to less sophisticated adaptations of known approaches. We, first, show that a combination of both pruning groups of time series and single time series outperforms the usage of a single pruning technique. Secondly, we compare the wall-clock run times of our data structure to existing approaches and determine a significant speed up for focused-interval similarity search queries on large temporal data sets, like the Google Books Ngram corpus.

[1]  Eamonn J. Keogh,et al.  Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping , 2012, KDD.

[2]  Eamonn Keogh Exact Indexing of Dynamic Time Warping , 2002, VLDB.

[3]  Maciej Krawczak,et al.  Time series envelopes for classification , 2010, 2010 5th IEEE International Conference Intelligent Systems.

[4]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[5]  Sang-Wook Kim,et al.  Using multiple indexes for efficient subsequence matching in time-series databases , 2006, Inf. Sci..

[6]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[7]  Eamonn J. Keogh,et al.  Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping , 2013, TKDD.

[8]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[9]  Yun-Hui Liu,et al.  An approach for fast subsequence matching through KMP algorithm in time series databases , 2002, Proceedings. International Conference on Machine Learning and Cybernetics.

[10]  Christos Faloutsos,et al.  FTW: fast similarity search under the time warping distance , 2005, PODS.

[11]  Quanzhong Li,et al.  Skyline index for time series data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[12]  Xiao-Ying Liu,et al.  Fast subsequence matching under time warping in time-series databases , 2013, 2013 International Conference on Machine Learning and Cybernetics.

[13]  Clement T. Yu,et al.  Haar Wavelets for Efficient Similarity Search of Time-Series: With and Without Time Warping , 2003, IEEE Trans. Knowl. Data Eng..

[14]  G. Jantzen 1988 , 1988, The Winning Cars of the Indianapolis 500.

[15]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[16]  Vit Niennattrakul,et al.  Exact indexing for massive time series databases under time warping distance , 2009, Data Mining and Knowledge Discovery.

[17]  Myeong-Seon Gil,et al.  Fast index construction for distortion-free subsequence matching in time-series databases , 2015, 2015 International Conference on Big Data and Smart Computing (BIGCOMP).

[18]  Gunter Saake,et al.  QuEval: Beyond high-dimensional indexing a la carte , 2013, Proc. VLDB Endow..

[19]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[20]  Niklas Olsen,et al.  History in the Plural: An Introduction to the Work of Reinhart Koselleck , 2012 .

[21]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[22]  Wesley W. Chu,et al.  An index-based approach for similarity search supporting time warping in large sequence databases , 2001, Proceedings 17th International Conference on Data Engineering.

[23]  Eamonn J. Keogh,et al.  Scaling and time warping in time series querying , 2005, The VLDB Journal.

[24]  Francisco Casacuberta,et al.  On the verification of triangle inequality by dynamic time-warping dissimilarity measures , 1988, Speech Commun..

[25]  Hans-Peter Kriegel,et al.  The TR*-Tree: A New Representation of Polygonal Objects Supporting Spatial Queries and Operations , 1991, Workshop on Computational Geometry.

[26]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[27]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[28]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[29]  Yi Du,et al.  Effective Subsequence Matching in Compressed Time Series , 2008, 2008 Third International Conference on Pervasive Computing and Applications.

[30]  Eamonn J. Keogh,et al.  Exact indexing of dynamic time warping , 2002, Knowledge and Information Systems.

[31]  Ira Assent,et al.  The TS-tree: efficient time series search and retrieval , 2008, EDBT '08.

[32]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[33]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[34]  Hans-Peter Kriegel,et al.  Interval-Focused Similarity Search in Time Series Databases , 2007, DASFAA.

[35]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.