Scalable Metric Similarity Join Using MapReduce

Given two collections of objects, metric similarity join finds all similar pairs of objects according to a particular distance function in metric space. There is an increasing demand to provide a scalable similarity join algorithm which can support efficient query and analytical services in the era of Big Data. In this paper, we propose SMS-Join, a parallel framework to support similarity join in metric space based on the MapReduce paradigm. The overall workflow of SMS-Join is that it first finds some records as pivots in the preprocessing phase and then splits the data into partitions based on them with a map job. Finally the join results are obtained via a reduce job. To ensure load balancing between the partitions, we devise a light-weighted sampling technique to obtain high quality samples while maintaining the high performance. To reduce the partition cost, we develop an iterative partition strategy in the map phase. We implement our framework upon Apache Spark platform and conduct extensive experiments on four real world datasets. The results show that our method significantly outperforms state-of-the-art methods.

[1]  Xiaoyong Du,et al.  Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[2]  Ying Zhang,et al.  An Efficient Framework for Exact Set Similarity Search Using Tree Structure Indexes , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[3]  Gang Chen,et al.  Metric Similarity Joins Using MapReduce , 2017, IEEE Transactions on Knowledge and Data Engineering.

[4]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[5]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[6]  Jin Wang,et al.  A Transformation-Based Framework for KNN Set Similarity Search , 2020, IEEE Transactions on Knowledge and Data Engineering.

[7]  Srinivasan Parthasarathy,et al.  Scalable all-pairs similarity search in metric spaces , 2013, KDD.

[8]  David K. Arrowsmith,et al.  Metric Spaces: Iteration and Application , 1986 .