Massively-Parallel Similarity Join, Edge-Isoperimetry, and Distance Correlations on the Hypercube

We study distributed protocols for finding all pairs of similar vectors in a large dataset. Our results pertain to a variety of discrete metrics, and we give concrete instantiations for Hamming distance. In particular, we give improved upper bounds on the overhead required for similarity defined by Hamming distance r > 1 and prove a lower bound showing qualitative optimality of the overhead required for similarity over any Hamming distance r. Our main conceptual contribution is a connection between similarity search algorithms and certain graph-theoretic quantities. For our upper bounds, we exhibit a general method for designing one-round protocols using edge-isoperimetric shapes in similarity graphs. For our lower bounds, we define a new combinatorial optimization problem, which can be stated in purely graph-theoretic terms yet also captures the core of the analysis in previous theoretical work on distributed similarity joins. As one of our main technical results, we prove new bounds on distance correlations in subsets of the Hamming cube.

[1]  Jimmy J. Lin,et al.  WTF: the who to follow service at Twitter , 2013, WWW.

[2]  Béla Bollobás,et al.  Sums in the grid , 1996, Discret. Math..

[3]  Dan Suciu,et al.  Communication Steps for Parallel Query Processing , 2017, J. ACM.

[4]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[5]  Charles J. Colbourn,et al.  Covering and packing for pairs , 2013, J. Comb. Theory, Ser. A.

[6]  Daniel Horsley,et al.  Generalising Fisher’s inequality to coverings and packings , 2014, Comb..

[7]  Timothy M. Chan,et al.  Polynomial Representations of Threshold Functions and Algorithmic Applications , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[8]  Andrew C. Yao,et al.  Lower bounds by probabilistic arguments , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[9]  Cyrus Rashtchian,et al.  Edge Isoperimetric Inequalities for Powers of the Hypercube. , 2019 .

[10]  Wei Liu,et al.  Learning to Hash for Indexing Big Data—A Survey , 2015, Proceedings of the IEEE.

[11]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[12]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[13]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[14]  Wei Liu,et al.  Sub-Selective Quantization for Large-Scale Image Search , 2014, AAAI.

[15]  Nathan Linial,et al.  The influence of variables on Boolean functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[16]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[17]  Martin Dietzfelbinger,et al.  Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes , 1996, STACS.

[18]  Jeffrey D. Ullman,et al.  Upper and Lower Bounds on the Cost of a Map-Reduce Computation , 2012, Proc. VLDB Endow..

[19]  Alexandr Andoni,et al.  Tight Lower Bounds for Data-Dependent Locality-Sensitive Hashing , 2015, SoCG.

[20]  Alexandr Andoni,et al.  Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors , 2016, SODA.

[21]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[22]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[23]  Yi Wu,et al.  Optimal Lower Bounds for Locality-Sensitive Hashing (Except When q is Tiny) , 2014, TOCT.

[24]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[25]  Rasmus Pagh,et al.  Scalability and Total Recall with Fast CoveringLSH , 2016, CIKM.

[26]  Rajeev Motwani,et al.  Lower bounds on locality sensitive hashing , 2005, SCG '06.

[27]  Miklós Simonovits,et al.  Compactness results in extremal graph theory , 1982, Comb..

[28]  Alexander Sidorenko,et al.  A correlation inequality for bipartite graphs , 1993, Graphs Comb..

[29]  Ashish Goel,et al.  Dimension independent similarity computation , 2012, J. Mach. Learn. Res..

[30]  Noga Alon,et al.  Non-averaging Subsets and Non-vanishing Transversals , 1999, J. Comb. Theory, Ser. A.

[31]  Haim Kaplan,et al.  Reporting Neighbors in High-Dimensional Euclidean Space , 2013, SIAM J. Comput..

[32]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[33]  Rasmus Pagh,et al.  On the Complexity of Inner Product Similarity Join , 2015, PODS.

[34]  S. Bezrukov Edge Isoperimetric Problems on Graphs , 2007 .

[35]  Sergiu Hart,et al.  A note on the edges of the n-cube , 1976, Discret. Math..

[36]  Fan Chung Graham,et al.  Concentration Inequalities and Martingale Inequalities: A Survey , 2006, Internet Math..

[37]  A. J. Bernstein,et al.  Maximally Connected Arrays on the n-Cube , 1967 .

[38]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[39]  Huanbo Luan,et al.  Discrete Collaborative Filtering , 2016, SIGIR.

[40]  LihChyun Shu,et al.  Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis , 2013, CIKM.

[41]  Rasmus Pagh Locality-sensitive Hashing without False Negatives , 2016, SODA.

[42]  Rina Panigrahy,et al.  Lower Bounds on Near Neighbor Search via Metric Expansion , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[43]  Pradeep Dubey,et al.  Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing , 2013, Proc. VLDB Endow..

[44]  Rasmus Pagh,et al.  I/O-Efficient Similarity Join , 2017, Algorithmica.

[45]  R. Manmatha,et al.  Partial duplicate detection for large book collections , 2011, CIKM '11.

[46]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[47]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[48]  Anna Pagh,et al.  Linear probing with constant independence , 2006, STOC '07.

[49]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[50]  Terence Tao,et al.  A new bound on partial sum-sets and difference-sets, and applications to the Kakeya conjecture , 1999 .

[51]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[52]  Jeffrey D. Ullman,et al.  Anchor-Points Algorithms for Hamming and Edit Distances Using MapReduce , 2014, ICDT.

[53]  John H. Lindsey,et al.  Assignment of Numbers to Vertices , 1964 .

[54]  Rina Panigrahy,et al.  A Geometric Approach to Lower Bounds for Approximate Near-Neighbor Search and Partial Match , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[55]  Russell Impagliazzo,et al.  Complexity of k-SAT , 1999, Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317).

[56]  Michiel H. M. Smid,et al.  Sequential and parallel algorithms for the k closest pairs problem , 1995, Int. J. Comput. Geom. Appl..

[57]  Ashish Goel,et al.  Efficient distributed locality sensitive hashing , 2012, CIKM.

[58]  Ryan Williams,et al.  Probabilistic Polynomials and Hamming Nearest Neighbors , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[59]  L. H. Harper Optimal Assignments of Numbers to Vertices , 1964 .

[60]  Mikkel Thorup High Speed Hashing for Integers and Strings , 2015, ArXiv.