Efficient Similarity Join over Multiple Stream Time Series

Similarity join (SJ) in time-series databases has a wide spectrum of applications such as data cleaning and mining. Specifically, an SJ query retrieves all pairs of (sub)sequences from two time-series databases that epsiv-match with each other, where epsiv is the matching threshold. Previous work on this problem usually considers static time-series databases, where queries are performed either on disk-based multidimensional indexes built on static data or by nested loop join (NLJ) without indexes. SJ over multiple stream time series, which continuously outputs pairs of similar subsequences from stream time series, strongly requires low memory consumption, low processing cost, and query procedures that are themselves adaptive to time-varying stream data. These requirements invalidate the existing approaches in static databases. In this paper, we propose an efficient and effective approach to perform SJ among multiple stream time series incrementally. In particular, we present a novel method, Adaptive Radius-based Search (ARES), which can answer the similarity search without false dismissals and is seamlessly integrated into SJ processing. Most importantly, we provide a formal cost model for ARES, based on which ARES can be adaptive to data characteristics, achieving the minimum number of refined candidate pairs, and thus, suitable for stream processing. Furthermore, in light of the cost model, we utilize space-efficient synopses that are constructed for stream time series to further reduce the candidate set. Extensive experiments demonstrate the efficiency and effectiveness of our proposed approach.

[1]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[2]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[3]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[4]  Xiaoyan Liu,et al.  Efficient k-NN Search on Streaming Data Series , 2003, SSTD.

[5]  Yufei Tao,et al.  RPJ: producing fast join results on streams through rate-based optimization , 2005, SIGMOD '05.

[6]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[7]  Hans-Peter Kriegel,et al.  Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[8]  Raymond T. Ng,et al.  Indexing spatio-temporal trajectories with Chebyshev polynomials , 2004, SIGMOD '04.

[9]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[10]  Yang-Sae Moon,et al.  Duality-based subsequence matching in time-series databases , 2001, Proceedings 17th International Conference on Data Engineering.

[11]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[12]  Yang-Sae Moon,et al.  General match: a subsequence matching method in time-series databases based on generalized windows , 2002, SIGMOD '02.

[13]  Lei Chen,et al.  On The Marriage of Lp-norms and Edit Distance , 2004, VLDB.

[14]  Lei Chen,et al.  On the Marriage of Edit Distance and Lp Norms , 2004, VLDB 2004.

[15]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[16]  Elke A. Rundensteiner,et al.  Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations , 1997, VLDB.

[17]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[18]  Walid G. Aref,et al.  Hash-merge join: a non-blocking join algorithm for producing fast and early join results , 2004, Proceedings. 20th International Conference on Data Engineering.

[19]  Ming-Ling Lo,et al.  Spatial hash-joins , 1996, SIGMOD '96.

[20]  Dennis Shasha,et al.  Efficient elastic burst detection in data streams , 2003, KDD '03.

[21]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[22]  Changzhou Wang,et al.  Supporting content-based searches on time series via approximation , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[23]  Ge Yu,et al.  Similarity Match Over High Speed Time-Series Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[24]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[25]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[26]  Christos Faloutsos,et al.  Efficiently supporting ad hoc queries in large datasets of time sequences , 1997, SIGMOD '97.

[27]  C. C. Heyde,et al.  Central Limit Theorem , 2006 .

[28]  K Lehnertz,et al.  Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[29]  Ambuj K. Singh,et al.  A unified framework for monitoring data streams in real time , 2005, 21st International Conference on Data Engineering (ICDE'05).

[30]  Christos Faloutsos,et al.  Stream Monitoring under the Time Warping Distance , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[31]  Yunhao Liu,et al.  Indexable PLA for Efficient Similarity Search , 2007, VLDB.

[32]  Dimitrios Gunopulos,et al.  Discovering similar multidimensional trajectories , 2002, Proceedings 18th International Conference on Data Engineering.

[33]  Apostolos N. Papadopoulos,et al.  Efficient similarity search in streaming time sequences , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[34]  Christos Faloutsos,et al.  Fast Time Sequence Indexing for Arbitrary Lp Norms , 2000, VLDB.

[35]  Donald J. Berndt,et al.  Finding Patterns in Time Series: A Dynamic Programming Approach , 1996, Advances in Knowledge Discovery and Data Mining.