Tight bounds for distributed functional monitoring

We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and communicating with a central coordinator. The coordinator's task is to continuously maintain an approximate output to a function computed over the union of the k streams. The goal is to minimize the number of bits communicated. Let the p-th frequency moment be defined as F<sub>p</sub> = ∑<sub>i</sub> f<sub>i</sub><sup>p</sup>, where f<sub>i</sub> is the frequency of element i. We show the randomized communication complexity of estimating the number of distinct elements (that is, F<sub>0</sub>) up to a 1+ε factor is Ω(k/ε<sup>2</sup>), improving upon the previous Ω(k + 1/ε<sup>2</sup>) bound and matching known upper bounds. For F<sub>p</sub>, p > 1, we improve the previous Ω(k + 1/ε<sup>2</sup>) communication bound to Ω(k<sup>p-1</sup>/ε<sup>2</sup>). We obtain similar improvements for heavy hitters, empirical entropy, and other problems. Our lower bounds are the first of any kind in distributed functional monitoring to depend on the product of k and 1/ε<sup>2</sup>. Moreover, the lower bounds are for the static version of the distributed functional monitoring model where the coordinator only needs to compute the function at the time when all <i>k</i> input streams end; surprisingly they almost match what is achievable in the (dynamic version of) distributed functional monitoring model where the coordinator needs to keep track of the function continuously at any time step. We also show that we can estimate F<sub>p</sub>, for any p > 1, using O(k<sup>p-1</sup> poly(ε<sup>-1</sup>)) communication. This drastically improves upon the previous O(k<sup>2p+1</sup>N<sup>1-2/p</sup> poly(ε<sup>-1</sup>)) bound of Cormode, Muthukrishnan, and Yi for general p, and their O(k<sup>2</sup>/ε + k<sup>1.5</sup>/ε<sup>3</sup>) bound for p = 2. For p = 2, our bound resolves their main open question. Our lower bounds are based on new direct sum theorems for approximate majority, and yield improvements to classical problems in the standard data stream model. First, we improve the known lower bound for estimating F<sub>p</sub>, p > 2, in t passes from Ω(n<sup>1-2/p</sup>/(ε<sup>2/p</sup> t)) to Ω(n<sup>1-2/p</sup>/(ε<sup>4/p</sup> t)), giving the first bound that matches what we expect when p = 2 for any constant number of passes. Second, we give the first lower bound for estimating F<sub>0</sub> in t passes with Ω(1/(ε<sup>2</sup> t)) bits of space that does not use the hardness of the gap-hamming problem.

[1]  W. Feller Generalization of a probability limit theorem of Cramér , 1943 .

[2]  Andrew Chi-Chih Yao,et al.  Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[3]  Bala Kalyanasundaram,et al.  The Probabilistic Communication Complexity of Set Intersection , 1992, SIAM J. Discret. Math..

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Alexander A. Razborov,et al.  On the Distributional Complexity of Disjointness , 1992, Theor. Comput. Sci..

[6]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[7]  E. Kushilevitz,et al.  Communication Complexity: Basics , 1996 .

[8]  José D. P. Rolim,et al.  Lower Bounds on the Multiparty Communication Complexity , 1998, J. Comput. Syst. Sci..

[9]  Satish Kumar,et al.  Next century challenges: scalable coordination in sensor networks , 1999, MobiCom.

[10]  Andrew Chi-Chih Yao,et al.  Informational complexity and the direct sum problem for simultaneous message complexity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[11]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[12]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[13]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.

[14]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[15]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[16]  Ravi Kumar,et al.  An information statistics approach to data stream and communication complexity , 2004, J. Comput. Syst. Sci..

[17]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[18]  Noga Alon,et al.  The Probabilistic Method, Second Edition , 2004 .

[19]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[20]  Graham Cormode,et al.  Holistic aggregates in a networked world: distributed tracking of approximate quantiles , 2005, SIGMOD '05.

[21]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[22]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[23]  Graham Cormode,et al.  Communication-efficient distributed monitoring of thresholded counts , 2006, SIGMOD Conference.

[24]  T. S. Jayram,et al.  OPEN PROBLEMS IN DATA STREAMS AND RELATED TOPICS IITK WORKSHOP ON ALGORITHMS FOR DATA STREAMS ’06 , 2007 .

[25]  Funda Ergün,et al.  On distance to monotonicity and longest increasing subsequence of a data stream , 2008, SODA '08.

[26]  T. S. Jayram,et al.  Tight lower bounds for selection in randomly ordered streams , 2008, SODA '08.

[27]  Graham Cormode,et al.  Algorithms for distributed functional monitoring , 2008, SODA '08.

[28]  Graham Cormode,et al.  Robust lower bounds for communication and stream computation , 2008, Theory Comput..

[29]  Krzysztof Onak,et al.  Sketching and Streaming Entropy via Approximation Theory , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[30]  Chrisil Arackaparambil,et al.  Functional Monitoring without Monotonicity , 2009, ICALP.

[31]  André Gronemeier,et al.  Asymptotically Optimal Lower Bounds on the NIH-Multi-Party Information Complexity of the AND-Function and Disjointness , 2009, STACS.

[32]  Joshua Brody,et al.  A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences , 2009, 2009 24th Annual IEEE Conference on Computational Complexity.

[33]  T. S. Jayram Hellinger Strikes Back: A Note on the Multi-party Information Complexity of AND , 2009, APPROX-RANDOM.

[34]  Sudipto Guha,et al.  Revisiting the Direct Sum Theorem and Space Lower Bounds in Random Order Streams , 2009, ICALP.

[35]  Anna Gál,et al.  Lower Bounds on Streaming Algorithms for Approximating the Length of the Longest Increasing Subsequence , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[36]  Qin Zhang,et al.  Optimal sampling from distributed streams , 2010, PODS '10.

[37]  Joshua Brody,et al.  Better Gap-Hamming Lower Bounds via Better Round Elimination , 2010, APPROX-RANDOM.

[38]  Graham Cormode,et al.  Information Cost Tradeoffs for Augmented Index and Streaming Language Recognition , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[39]  Claire Mathieu,et al.  Recognizing well-parenthesized expressions in the streaming model , 2009, STOC '10.

[40]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[41]  Amit Chakrabarti,et al.  An Optimal Lower Bound on the Communication Complexity of Gap-Hamming-Distance , 2012, SIAM J. Comput..

[42]  Qin Zhang,et al.  Optimal Tracking of Distributed Heavy Hitters and Quantiles , 2011, Algorithmica.

[43]  David P. Woodruff,et al.  Open Problems in Data Streams, Property Testing, and Related Topics , 2011 .

[44]  Thomas Vidick,et al.  A concentration inequality for the overlap of a vector on a large set, with application to the communication complexity of the Gap-Hamming-Distance problem , 2011, Chic. J. Theor. Comput. Sci..

[45]  Sumit Ganguly,et al.  Polynomial Estimators for High Frequency Moments , 2011, ArXiv.

[46]  Alexander A. Sherstov The Communication Complexity of Gap Hamming Distance , 2012, Theory Comput..

[47]  David P. Woodruff,et al.  Fast moment estimation in data streams in optimal space , 2010, STOC '11.

[48]  David P. Woodruff,et al.  Optimal Random Sampling from Distributed Streams Revisited , 2011, DISC.

[49]  Omer Reingold,et al.  Pseudorandom generators for combinatorial shapes , 2011, STOC '11.

[50]  Amit Chakrabarti,et al.  Information Complexity versus Corruption and Applications to Orthogonality and Gap-Hamming , 2012, APPROX-RANDOM.

[51]  Qin Zhang,et al.  Randomized algorithms for tracking distributed count, frequencies, and ranks , 2012, PODS '12.

[52]  Sumit Ganguly A Lower Bound for Estimating High Moments of a Data Stream , 2012, ArXiv.

[53]  Xi Chen,et al.  How to Compress Interactive Communication , 2013, SIAM J. Comput..