The Coin Problem with Applications to Data Streams

Consider the problem of computing the majority of a stream of $n$ i.i.d. uniformly random bits. This problem, known as the coin problem, is central to a number of counting problems in different data stream models. We show that any streaming algorithm for solving this problem with large constant advantage must use $\Omega(\log n)$ bits of space. We extend our lower bound to proving tight lower bounds for solving multiple, randomly interleaved copies of the coin problem, as well as for solving the OR of multiple copies of a variant of the coin problem. Our proofs involve new measures of information complexity that are well-suited for data streams. We use these lower bounds to obtain a number of new results for data streams. In each case there is an underlying $d$ dimensional vector $x$ with additive updates to its coordinates given in a stream of length $m$. The input streams arising from our coin lower bound have nice distributional properties, and consequently for many problems for which we only had lower bounds in general turnstile streams, we now obtain the same lower bounds in more natural models, such as the bounded deletion model, in which $\Vert x\Vert_{2}$ never drops by a constant fraction of what it was earlier, or in the random order model, in which the updates are ordered randomly. In particular, in the bounded deletion model, we obtain nearly tight lower bounds for approximating $\Vert x\Vert_{\infty}$ up to additive error $\frac{1}{\sqrt{k}}\Vert x\Vert_{2}$, approximating $\Vert x\Vert_{2}$ up to a multiplicative ($1+\epsilon$) factor (resolving a question of Jayaram and Woodruff in PODS 2018), and solving the Point Query and $\ell_{2}$-Heavy Hitters Problems. In the random order model, we also obtain new lower bounds for the Point Query and $\ell_{2}$-Heavy Hitters Problems. We also give new algorithms complementing our lower bounds and illustrating the tightness of the models we consider, including an algorithm for approximating $\Vert x\Vert_{\infty}$ up to additive error $\frac{1}{\sqrt{k}}\Vert x\Vert_{2}$ in turnstile streams (resolving a question of Cormode in a 2006 IITK Workshop), and an algorithm for finding $\ell_{2}$-heavy hitters in randomly ordered insertion streams (which for random order streams, resolves a question of Nelson in a 2018 Warwick Workshop).

[1]  John P. Steinberger The Distinguishability of Product Distributions by Read-Once Branching Programs , 2013, 2013 IEEE Conference on Computational Complexity.

[2]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[3]  Yaming Yu,et al.  Sharp Bounds on the Entropy of the Poisson Law and Related Quantities , 2010, IEEE Transactions on Information Theory.

[4]  Robert H. Morris,et al.  Counting large numbers of events in small registers , 1978, CACM.

[5]  Svante Janson Large deviation inequalities for sums of indicator variables , 2016 .

[6]  Graham Cormode,et al.  Sketch Techniques for Approximate Query Processing , 2010 .

[7]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[8]  Emanuele Viola,et al.  The Coin Problem for Product Tests , 2017, Electron. Colloquium Comput. Complex..

[9]  David P. Woodruff,et al.  Data Streams with Bounded Deletions , 2018, PODS.

[10]  David P. Woodruff,et al.  Towards Optimal Moment Estimation in Streaming and Distributed Models , 2019, APPROX-RANDOM.

[11]  David P. Woodruff,et al.  BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory , 2016, PODS.

[12]  D. Berend,et al.  A sharp estimate of the binomial mean absolute deviation with applications , 2013 .

[13]  H. Robbins A Remark on Stirling’s Formula , 1955 .

[14]  David P. Woodruff,et al.  Fast moment estimation in data streams in optimal space , 2010, STOC '11.

[15]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[16]  Jelani Nelson,et al.  Sketching and streaming algorithms for processing massive data , 2012, XRDS.

[17]  Atri Rudra,et al.  The Range of Topological Effects on Communication , 2015, ICALP.

[18]  David P. Woodruff,et al.  Beating CountSketch for heavy hitters in insertion streams , 2015, STOC.

[19]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[20]  Joshua Brody,et al.  The Coin Problem and Pseudorandomness for Branching Programs , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[21]  Ran Raz,et al.  Two Sides of the Coin Problem , 2014, APPROX-RANDOM.

[22]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[23]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[24]  David P. Woodruff,et al.  On the exact space complexity of sketching and streaming small norms , 2010, SODA '10.

[25]  NelsonJelani Sketching and streaming algorithms for processing massive data , 2012 .

[26]  Atri Rudra,et al.  Topology Matters in Communication , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[27]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[28]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, IEEE/ACM Transactions on Networking.

[29]  Graham Cormode,et al.  Robust lower bounds for communication and stream computation , 2008, Theory Comput..

[30]  Mikkel Thorup,et al.  Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation , 2012, SIAM J. Comput..

[31]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..