Efficient Top-k Similarity Join of Massive Time Series Using MapReduce

Top-k similarity join of time series, designed to find top-k most similar pairs of time series records, is a primitive operation widely adopted by many time series data analysis applications. However, computing such top-k similarity join is a challenging problem today, as many modern applications are creating massive amounts of time series data. Obviously, a centralized machine is difficult to perform top-k similarity join in a large time series database efficiently. In this paper, we investigate how to perform the top-k similarity join of massive time series in parallel using MapReduce over a large cluster of commodity machines. Our proposed MapReduce-based algorithm consists of four steps, which takes as input a set of time series records and output an ordered list of top k closest pairs. To improve the efficiency in computing top-k similarity join, we proposed several solutions. We first introduce an efficient distance function based on LSH (Locality Sensitive Hash) for time series to improve the efficiency in pairwise similarity comparison. We next propose all pair partitioning methods to minimize the amount of data transfers between map and reduce functions. Moreover, we make use of serial computation strategy for parallelizing the computation of local top-k closest pairs in each partition. Our performance study confirms the effectiveness and scalability of our MapReduce algorithms.