Efficient Similarity Join for Time Sequences Using Locality Sensitive Hash and Mapreduce

In this paper we study how to efficiently perform similarity join over massive time sequences in parallel using Locality Sensitive Hash and MapReduce. To solve the problem, we proposed a 4-stage approach for time sequences similarity join. Our proposed approach takes as input a set of time sequences, and output pairs of time sequences satisfying a similarity join condition. In our approach, we first map each time sequence into the frequency domain using Discrete Fourier Transform to avoid the dimension curse. Secondly, we find the candidate similar time sequence pairs using the Locality Sensitive Hash, which can ensure an efficient pair-wise similarity computation. Thirdly, we also propose solutions for removing duplicated pairs to avoid repeated computation for similarity pairs that are selected as candidate for more than once. Finally, in order to improve the performance of similarity join over massive time sequences, we use the popular MapReduce framework in each step. The experimental results show that our method is efficient and scalable.