The Communication Complexity of Distributed Set-Joins with Applications to Matrix Multiplication

Given a set-comparison predicate P and given two lists of sets A = (A1,...,Am) and B = (B1,...,Bm), with all Ai, Bj ⊆ [n], the P-set join A bowtieP B is defined to be the set {(i, j) in [m] x [m] | P(Ai,Bj)}. When P(Ai,Bj) is the condition "Ai ∩ Bj ≠ is empty " we call this the set-intersection-notempty join (a.k.a. the composition of A and B); when P(Ai,Bj) is "Ai ∩ Bj is empty" we call it the set-disjointness join; when P(Ai,Bj) is "Ai = Bj" we call it the set-equality join; when P(Ai,Bj) is "|Ai ∩ Bj| ≥ T" for a given threshold T, we call it the set-intersection threshold join. Assuming A and B are stored at two different sites in a distributed environment, we study the (randomized) communication complexity of computing these, and related, set-joins A bowtieP B, as well as the (randomized) communication complexity of computing the exact and approximate value of their size k = |A bowtieP B|. Combined, our analyses shed new insights into the quantitative differences between these different set-joins. Furthermore, given the close affinity of the natural join and the set-intersection-not-empty join, our results also yield communication complexity results for computing the natural join in a distributed environment. Additionally, we obtain new algorithms for computing the distributed set-intersection-not empty join when the input and/or output is sparse. For instance, when the output is k sparse, we improve an Õ(kn) communication algorithm of (Williams and Yu, SODA 2014). Observing that the set-intersection-not-empty join is isomorphic to Boolean matrix multiplication (BMM), our results imply new algorithms for fundamental graph theoretic problems related to BMM. For example, we show how to compute the transitive closure of a directed graph in Õ(k3/2) time, when the transitive closure contains at most k edges. When k = O(n), we obtain a (practical) Õ(n3/2) time algorithm, improving a recent Õ(n1+(č+3)/4) time algorithm (Borassi, Crescenzi, and Habib, arXiv 2014) based on (impractical) fast matrix multiplication, where č ≥ 2 is the exponent for matrix multiplication.

[1]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[2]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[3]  Jeffrey F. Naughton,et al.  Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[4]  Vaughan R. Pratt,et al.  On the Syllogism: IV; and on the Logic of Relations , 2022 .

[5]  Michel Habib,et al.  Into the Square - On the Complexity of Quadratic-Time Solvable Problems , 2014, ArXiv.

[6]  Sven Helmer,et al.  Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates , 1996, VLDB.

[7]  Ryan Williams,et al.  Finding orthogonal vectors in discrete structures , 2014, SODA.

[8]  Antonio Badia,et al.  Querying with Generalized Quantifiers , 1993, Workshop on Programming with Logic Databases , ILPS.

[9]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[10]  Jeremy P. Spinrad,et al.  Linear-time modular decomposition and efficient transitive orientation of comparability graphs , 1994, SODA '94.

[11]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[12]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[13]  Antonio Badia,et al.  Supporting quantified queries in distributed databases , 2014, Int. J. Parallel Emergent Distributed Syst..

[14]  Joshua Brody,et al.  Beyond set disjointness: the communication complexity of finding the intersection , 2014, PODC '14.

[15]  Andrzej Lingas,et al.  A Fast Output-Sensitive Algorithm for Boolean Matrix Multiplication , 2011, Algorithmica.

[16]  Margaret H. Dunham,et al.  Join processing in relational databases , 1992, CSUR.

[17]  Ravi Kumar,et al.  The One-Way Communication Complexity of Hamming Distance , 2008, Theory Comput..

[18]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[19]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[20]  Maarten Marx,et al.  Navigational XPath: calculus and algebra , 2007, SGMD.

[21]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[22]  Alexander A. Razborov,et al.  On the Distributional Complexity of Disjointness , 1992, Theor. Comput. Sci..

[23]  Ilan Newman,et al.  Private vs. Common Random Bits in Communication Complexity , 1991, Inf. Process. Lett..

[24]  Gábor Tardos,et al.  On the Communication Complexity of Sparse Set Disjointness and Exists-Equal Problems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[25]  Von der Fakult QUERY PROCESSING CONCEPTS AND TECHNIQUES FOR SET CONTAINMENT TESTS , 2003 .

[26]  Martin Charles Golumbic,et al.  The complexity of comparability graph recognition and coloring , 1977, Computing.

[27]  Rasmus Pagh,et al.  Compressed matrix multiplication , 2011, ITCS '12.

[28]  Andrew Chi-Chih Yao,et al.  Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[29]  E. F. CODD,et al.  A relational model of data for large shared data banks , 1970, CACM.

[30]  Goetz Graefe,et al.  Fast algorithms for universal quantification in large databases , 1995, TODS.

[31]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[32]  Jan Van den Bussche,et al.  On the complexity of division and set joins in the relational algebra , 2005, PODS '05.

[33]  Graham Cormode,et al.  Combinatorial Algorithms for Compressed Sensing , 2006 .

[34]  Ping-Yu Hsu,et al.  Improving SQL with generalized quantifiers , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[35]  Hector Garcia-Molina,et al.  Adaptive algorithms for set containment joins , 2003, TODS.

[36]  Guido Moerkotte,et al.  Optimizing Queries with Universal Quantification in Object-Oriented and Object-Relational Databases , 1997, VLDB.

[37]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[38]  E. Kushilevitz,et al.  Communication Complexity: Basics , 1996 .

[39]  C. R. Subramanian,et al.  Almost Optimal (on the average) Combinatorial Algorithms for Boolean Matrix Product Witnesses, Computing the Diameter (Extended Abstract) , 1998, RANDOM.

[40]  Antonio Badia,et al.  Providing better support for a class of decision support queries , 1996, SIGMOD '96.

[41]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[42]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[43]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[44]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[45]  Jan Van den Bussche,et al.  Relative expressive power of navigational querying on graphs , 2011, ICDT '11.

[46]  David P. Woodruff,et al.  An Optimal Lower Bound for Distinct Elements in the Message Passing Model , 2014, SODA.

[47]  Mohammad Dadashzadeh An improved division operator for relational algebra , 1989, Inf. Syst..