Beyond set disjointness: the communication complexity of finding the intersection

We consider the following fundamental communication problem - there is data that is distributed among servers, and the servers want to compute the intersection of their data sets, e.g., the common records in a relational database. They want to do this with as little communication and as few messages (rounds) as possible. They are willing to use randomization, and fail with a tiny probability. Given a protocol for computing the intersection, it can also be used to compute the exact Jaccard similarity, the rarity, the number of distinct elements, and joins between databases. Computing the intersection is at least as hard as the set disjointness problem, which asks whether the intersection is empty. Formally, in the two-server setting, the players hold subsets S, T ⊆ [n]. In many realistic scenarios, the sizes of S and T are significantly smaller than n, so we impose the constraint that |S|, |T| ≤ k. We study the minimum number of bits the parties need to communicate in order to compute the intersection set S ∩ T, given a certain number r of messages that are allowed to be exchanged. While O(k log (n/k)) bits is achieved trivially and deterministically with a single message, we ask what is possible with more than one message and with randomization. We give a smooth communication/round tradeoff which shows that with O(log* k) rounds, O(k) bits of communication is possible, which improves upon the trivial protocol by an order of magnitude. This is in contrast to other basic problems such as computing the union or symmetric difference, for which Ω(k log(n/k)) bits of communication is required for any number of rounds. For two players, known lower bounds for the easier problem of set disjointness imply our algorithms are optimal up to constant factors in communication and number of rounds. We extend our protocols to $m$-player protocols, obtaining an optimal O(mk) bits of communication with a similarly small number of rounds.

[1]  Ronald de Wolf,et al.  The non-adaptive query complexity of testing k-parities , 2013, Chic. J. Theor. Comput. Sci..

[2]  Avi Wigderson,et al.  The Randomized Communication Complexity of Set Disjointness , 2007, Theory Comput..

[3]  David P. Woodruff,et al.  Lower bounds for sparse recovery , 2010, SODA '10.

[4]  Ran Raz,et al.  Direct product results and the GCD problem, in old and new communication models , 1997, STOC '97.

[5]  Jiawei Han,et al.  Optimizing index for taxonomy keyword search , 2012, SIGMOD Conference.

[6]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[7]  Ran Raz,et al.  Monotone circuits for matching require linear depth , 1990, STOC '90.

[8]  Gábor Tardos,et al.  On the Communication Complexity of Sparse Set Disjointness and Exists-Equal Problems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[9]  Mark Braverman,et al.  Tight Bounds for Set Disjointness in the Message Passing Model , 2013, ArXiv.

[10]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[11]  Qin Zhang,et al.  Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy , 2011, SIAM J. Comput..

[12]  Moni Naor,et al.  Amortized Communication Complexity , 1995, SIAM J. Comput..

[13]  Noam Nisan,et al.  The communication requirements of efficient allocations and supporting prices , 2006, J. Econ. Theory.

[14]  Ilan Newman,et al.  Private vs. Common Random Bits in Communication Complexity , 1991, Inf. Process. Lett..

[15]  David P. Woodruff,et al.  On the exact space complexity of sketching and streaming small norms , 2010, SODA '10.

[16]  Xudong Lin,et al.  Fast SLCA and ELCA Computation for XML Keyword Queries Based on Set Intersection , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[17]  Mark Braverman,et al.  Information Equals Amortized Communication , 2011, IEEE Transactions on Information Theory.

[18]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[19]  David P. Woodruff,et al.  Is min-wise hashing optimal for summarizing set intersection? , 2014, PODS.

[20]  References , 1971 .

[21]  L FredmanMichael,et al.  Storing a Sparse Table with 0(1) Worst Case Access Time , 1984 .

[22]  Bala Kalyanasundaram,et al.  The Probabilistic Communication Complexity of Set Intersection , 1992, SIAM J. Discret. Math..

[23]  Bolin Ding,et al.  Fast Set Intersection in Memory , 2011, Proc. VLDB Endow..

[24]  Sergei Vassilvitskii,et al.  Efficiently encoding term co-occurrences in inverted indexes , 2011, CIKM '11.

[25]  Joshua Brody,et al.  Certifying Equality With Limited Interaction , 2016, Algorithmica.

[26]  Andrew Chi-Chih Yao,et al.  Some complexity questions related to distributive computing(Preliminary Report) , 1979, STOC.

[27]  A. Razborov Communication Complexity , 2011 .

[28]  Alexander A. Razborov,et al.  On the Distributional Complexity of Disjointness , 1992, Theor. Comput. Sci..

[29]  Anirban Dasgupta,et al.  Sparse and Lopsided Set Disjointness via Information Theory , 2012, APPROX-RANDOM.