论文信息 - LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets. Traditionally, similarity search has focused on discovering very similar pairs, for which a variety of efficient algorithms are known. However, recent work has highlighted the importance of discovering pairs of sets with relatively small intersection sizes. For example, in a recommender system, two users may be alike even though their interests only overlap on a small percentage of items. In such systems, it is also common that some dimensions are highly-skewed, because they are very popular. Together, these two properties render previous approaches infeasible for large input sizes. To address this problem, we present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity. The core of our algorithm is a randomized selection procedure based on Locality Sensitive Filtering. In particular, our method deviates from prior approximate algorithms, which are based on Locality Sensitive Hashing. Theoretically, we show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets. We prove guarantees on the communication, work, and maximum load of LSF-Join, and we also experimentally demonstrate its accuracy on multiple graphs.

Cyrus Rashtchian | David P. Woodruff | Aneesh Sharma | Cyrus Rashtchian | Aneesh Sharma

[1] Lijun Chang,et al. Leveraging Set Relations in Exact Set Similarity Join , 2017, Proc. VLDB Endow..

[2] Dan Suciu,et al. Skew in parallel query processing , 2014, PODS.

[3] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[4] Nikolaus Augsten,et al. An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[5] Rasmus Pagh,et al. Scalable and Robust Set Similarity Join , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[6] Pradeep Dubey,et al. Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[7] Rasmus Pagh,et al. Set Similarity Search for Skewed Data , 2018, PODS.

[8] IndykPiotr,et al. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing , 2013, VLDB 2013.

[9] Aditya G. Parameswaran,et al. Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[10] Michael H. Böhlen,et al. Similarity Joins in Relational Database Systems , 2013, Similarity Joins in Relational Database Systems.

[11] Piotr Indyk,et al. Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[12] Renée J. Miller,et al. LSH Ensemble: Internet-Scale Domain Search , 2016, Proc. VLDB Endow..

[13] Anand Rajaraman,et al. Mining of Massive Datasets , 2011 .

[14] Tobias Christiani,et al. A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering , 2016, SODA.

[15] Jeffrey D. Ullman,et al. Anchor-Points Algorithms for Hamming and Edit Distances Using MapReduce , 2014, ICDT.

[16] Dan Suciu,et al. Algorithmic Aspects of Parallel Data Processing , 2018, Found. Trends Databases.

[17] Jure Leskovec,et al. {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[18] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.

[19] Yufei Tao,et al. Overlap Set Similarity Joins with Theoretical Guarantees , 2018, SIGMOD Conference.

[20] Mikkel Thorup,et al. Hardness of Bichromatic Closest Pair with Jaccard Similarity , 2019, ESA.

[21] Ranieri Baraglia,et al. Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[22] Sergei Vassilvitskii,et al. A model of computation for MapReduce , 2010, SODA '10.

[23] LihChyun Shu,et al. Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis , 2013, CIKM.

[24] Guoliang Li,et al. Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[25] Jeffrey D. Ullman,et al. Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[26] Jeffrey Xu Yu,et al. Efficient similarity joins for near-duplicate detection , 2011, TODS.

[27] Ashish Goel,et al. When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors , 2017, WWW.

[28] Din J. Wasem,et al. Mining of Massive Datasets , 2014 .

[29] Yasin N. Silva,et al. An Experimental Survey of MapReduce-Based Similarity Joins , 2016, SISAP.

[30] Chen Li,et al. Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[31] Tao Yang,et al. Optimizing parallel algorithms for all pairs similarity search , 2013, WSDM.

[32] Ulf Leser,et al. Set Similarity Joins on MapReduce: An Experimental Survey , 2018, Proc. VLDB Endow..

[33] Yufei Tao,et al. Output-Optimal Massively Parallel Algorithms for Similarity Joins , 2019, ACM Trans. Database Syst..

[34] Cyrus Rashtchian,et al. Massively-Parallel Similarity Join, Edge-Isoperimetry, and Distance Correlations on the Hypercube , 2016, SODA.

[35] Cong Wang,et al. A Generic Method for Accelerating LSH-Based Similarity Join Processing , 2017, IEEE Transactions on Knowledge and Data Engineering.

[36] Samuel McCauley,et al. Adaptive MapReduce Similarity Joins , 2018, BeyondMR@SIGMOD.

[37] Rasmus Pagh,et al. On the Complexity of Inner Product Similarity Join , 2015, PODS.