论文信息 - Efficient Similarity Join for Time Sequences Using Locality Sensitive Hash and Mapreduce

Efficient Similarity Join for Time Sequences Using Locality Sensitive Hash and Mapreduce

In this paper we study how to efficiently perform similarity join over massive time sequences in parallel using Locality Sensitive Hash and MapReduce. To solve the problem, we proposed a 4-stage approach for time sequences similarity join. Our proposed approach takes as input a set of time sequences, and output pairs of time sequences satisfying a similarity join condition. In our approach, we first map each time sequence into the frequency domain using Discrete Fourier Transform to avoid the dimension curse. Secondly, we find the candidate similar time sequence pairs using the Locality Sensitive Hash, which can ensure an efficient pair-wise similarity computation. Thirdly, we also propose solutions for removing duplicated pairs to avoid repeated computation for similarity pairs that are selected as candidate for more than once. Finally, in order to improve the performance of similarity join over massive time sequences, we use the popular MapReduce framework in each step. The experimental results show that our method is efficient and scalable.

Dehua Chen | Shoujian Yu | Meng Zhou | Liangliang Zheng

[1] Eamonn J. Keogh,et al. Data Mining a Trillion Time Series Subsequences Under Dynamic Time Warping , 2013, IJCAI.

[2] Ming-Ling Lo,et al. Spatial hash-joins , 1996, SIGMOD '96.

[3] Hans-Peter Kriegel,et al. Efficient processing of spatial joins using R-trees , 1993, SIGMOD Conference.

[4] Christos Faloutsos,et al. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[5] Din J. Wasem,et al. Mining of Massive Datasets , 2014 .

[6] Elke A. Rundensteiner,et al. Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations , 1997, VLDB.

[7] Alberto O. Mendelzon,et al. Similarity-based queries for time series data , 1997, SIGMOD '97.

[8] Christos Faloutsos,et al. Efficient Similarity Search In Sequence Databases , 1993, FODO.

[9] Eamonn J. Keogh,et al. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[10] Chen Li,et al. Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[11] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.