The average-case complexity of counting distinct elements

We continue the study of approximating the number of distinct elements in a data stream of length n to within a (1 ± ε) factor. It is known that if the stream may consist of arbitrary data arriving in an arbitrary order, then any 1-pass algorithm requires Ω(1/ε2) bits of space to perform this task. To try to bypass this lower bound, the problem was recently studied in a model in which the stream may consist of arbitrary data, but it arrives to the algorithm in a random order. However, even in this model an Ω(1/ε2) lower bound was established. This is because the adversary can still choose the data arbitrarily. This leaves open the possibility that the problem is only hard under a pathological choice of data, which would be of little practical relevance. We study the average-case complexity of this problem under certain distributions. Namely, we study the case when each successive stream item is drawn independently and uniformly at random from an unknown subset of d items for an unknown value of d. This captures the notion of random uncorrelated data. For a wide range of values of d and n, we design a 1-pass algorithm that bypasses the Ω(1/ε2) lower bound that holds in the adversarial and random-order models, thereby showing that this model admits more space-efficient algorithms. Moreover, the update time of our algorithm is optimal. Despite these positive results, for a certain range of values of d and n we show that estimating the number of distinct elements requires Ω(1/ε2) bits of space even in this model. Our lower bound subsumes previous bounds, showing that even for natural choices of data the problem is hard.

[1]  Christopher M. Jones,et al.  An introduction to coding theory. , 2001 .

[2]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[3]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[4]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[5]  Sudipto Guha,et al.  Lower Bounds for Quantile Estimation in Random-Order and Multi-pass Streaming , 2007, ICALP.

[6]  Jacobus H. van Lint,et al.  Introduction to Coding Theory , 1982 .

[7]  Sergei Vassilvitskii,et al.  Distinct Values Estimators for Power Law Distributions , 2006, ANALCO.

[8]  Ravi Kumar,et al.  The One-Way Communication Complexity of Hamming Distance , 2008, Theory Comput..

[9]  Graham Cormode,et al.  Summarizing and Mining Skewed Data Streams , 2005, SDM.

[10]  David P. Woodruff Efficient and private distance approximation in the communication and streaming models , 2007 .

[11]  Ravi Kumar,et al.  On Finding Frequent Elements in a Data Stream , 2007, APPROX-RANDOM.

[12]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[13]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[14]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[15]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[16]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[17]  E. Kushilevitz,et al.  Communication Complexity: Basics , 1996 .

[18]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[19]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[20]  Michael R. Fenton,et al.  Yes, the GIGP Really Does Work--And Is Workable!. , 1993 .

[21]  Paul G. Spirakis,et al.  Tail bounds for occupancy and the satisfiability threshold conjecture , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[22]  J. H. van Lint,et al.  Introduction to Coding Theory , 1982 .

[23]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[24]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[25]  Arthur Stanley,et al.  Yes , 1923, The Hospital and health review.

[26]  David P. Woodru Ecient and Private Distance Approximation in the Communication and Streaming Models , 2007 .

[27]  T. S. Jayram,et al.  Tight lower bounds for selection in randomly ordered streams , 2008, SODA '08.

[28]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[29]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[30]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[31]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[32]  Sudipto Guha,et al.  Space-Efficient Sampling , 2007, AISTATS.

[33]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[34]  Graham Cormode,et al.  Robust lower bounds for communication and stream computation , 2008, Theory Comput..

[35]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[36]  Srinivasan Seshan,et al.  Detecting DDoS Attacks on ISP Networks , 2003 .

[37]  Sudipto Guha,et al.  Approximate quantiles and the order of the stream , 2006, PODS.