Similarity join is an important operation in MapReduce framework to find pairs of similar objects like images, video and time series. Since MapReduce basics do not support efficient join processing, the duplicate reduction of candidates and load-balancing among partitions are the major challenges. Recently, many partition based similarity join algorithms have been proposed to solve such problems. However, the existing algorithms still have limitations for supporting efficient join processing over large-scale data set. In this paper, we proposed a similarity join algorithm with an efficient filtering technique on MapReduce to overcome the limitations of traditional partitioning method in two ways: (1) the number of outputs records generated by the filtering matrix reduces duplicates and (2) the estimated join cost generated by using a partition matrix leads to a better load-balance among reducers. Moreover, we have conducted experimental evaluations using sequential data to show the speed-up and scale-up of proposed method.
[1]
Christos Faloutsos,et al.
V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors
,
2012,
Proc. VLDB Endow..
[2]
Jignesh M. Patel,et al.
A comparison of join algorithms for log processing in MaPreduce
,
2010,
SIGMOD Conference.
[3]
Mirek Riedewald,et al.
Processing theta-joins using MapReduce
,
2011,
SIGMOD '11.
[4]
Yannis Theodoridis,et al.
On the Generation of Spatiotemporal Datasets
,
1999
.
[5]
Yeye He,et al.
ClusterJoin: A Similarity Joins Framework using Map-Reduce
,
2014,
Proc. VLDB Endow..
[6]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.