LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets. Traditionally, similarity search has focused on discovering very similar pairs, for which a variety of efficient algorithms are known. However, recent work has highlighted the importance of discovering pairs of sets with relatively small intersection sizes. For example, in a recommender system, two users may be alike even though their interests only overlap on a small percentage of items. In such systems, it is also common that some dimensions are highly-skewed, because they are very popular. Together, these two properties render previous approaches infeasible for large input sizes. To address this problem, we present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity. The core of our algorithm is a randomized selection procedure based on Locality Sensitive Filtering. In particular, our method deviates from prior approximate algorithms, which are based on Locality Sensitive Hashing. Theoretically, we show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets. We prove guarantees on the communication, work, and maximum load of LSF-Join, and we also experimentally demonstrate its accuracy on multiple graphs.

[1]  Lijun Chang,et al.  Leveraging Set Relations in Exact Set Similarity Join , 2017, Proc. VLDB Endow..

[2]  Dan Suciu,et al.  Skew in parallel query processing , 2014, PODS.

[3]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[4]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[5]  Rasmus Pagh,et al.  Scalable and Robust Set Similarity Join , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[6]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[7]  Rasmus Pagh,et al.  Set Similarity Search for Skewed Data , 2018, PODS.

[8]  IndykPiotr,et al.  Streaming similarity search over one billion tweets using parallel locality-sensitive hashing , 2013, VLDB 2013.

[9]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[10]  Michael H. Böhlen,et al.  Similarity Joins in Relational Database Systems , 2013, Similarity Joins in Relational Database Systems.

[11]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[12]  Renée J. Miller,et al.  LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[13]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[14]  Tobias Christiani,et al.  A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering , 2016, SODA.

[15]  Jeffrey D. Ullman,et al.  Anchor-Points Algorithms for Hamming and Edit Distances Using MapReduce , 2014, ICDT.

[16]  Dan Suciu,et al.  Algorithmic Aspects of Parallel Data Processing , 2018, Found. Trends Databases.

[17]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[18]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[19]  Yufei Tao,et al.  Overlap Set Similarity Joins with Theoretical Guarantees , 2018, SIGMOD Conference.

[20]  Mikkel Thorup,et al.  Hardness of Bichromatic Closest Pair with Jaccard Similarity , 2019, ESA.

[21]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[22]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[23]  LihChyun Shu,et al.  Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis , 2013, CIKM.

[24]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[25]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[26]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[27]  Ashish Goel,et al.  When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors , 2017, WWW.

[28]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[29]  Yasin N. Silva,et al.  An Experimental Survey of MapReduce-Based Similarity Joins , 2016, SISAP.

[30]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[31]  Tao Yang,et al.  Optimizing parallel algorithms for all pairs similarity search , 2013, WSDM.

[32]  Ulf Leser,et al.  Set Similarity Joins on MapReduce: An Experimental Survey , 2018, Proc. VLDB Endow..

[33]  Yufei Tao,et al.  Output-Optimal Massively Parallel Algorithms for Similarity Joins , 2019, ACM Trans. Database Syst..

[34]  Cyrus Rashtchian,et al.  Massively-Parallel Similarity Join, Edge-Isoperimetry, and Distance Correlations on the Hypercube , 2016, SODA.

[35]  Cong Wang,et al.  A Generic Method for Accelerating LSH-Based Similarity Join Processing , 2017, IEEE Transactions on Knowledge and Data Engineering.

[36]  Samuel McCauley,et al.  Adaptive MapReduce Similarity Joins , 2018, BeyondMR@SIGMOD.

[37]  Rasmus Pagh,et al.  On the Complexity of Inner Product Similarity Join , 2015, PODS.