Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem

We consider the problem of approximating the support size of a distribution from a small number of samples, when each element in the distribution appears with probability at least 1/n. This problem is closely related to the problem of approximating the number of distinct elements in a sequence of length n. For both problems, we prove a nearly linear in n lower bound on the query complexity, applicable even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables. X<sub>1</sub> and X<sub>2</sub>, with very different expectations and the following condition on the first k moments: E[X<sub>1</sub>]/E[X<sub>2</sub>] = E[X<sub>1</sub> <sup>2</sup>]/E[X<sub>2</sub> <sup>2</sup>] = ... = E[X<sub>1</sub> <sup>k</sup>]/E[X<sub>2</sub> <sup>k</sup>]. Our lower bound method is also applicable to other problems. In particular, it gives new lower bounds for the sample complexity of (1) approximating the entropy of a distribution and (2) approximating how well a given string is compressed by the Lempel-Ziv scheme.

[1]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[2]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[3]  Moses Charikar,et al.  On the Advantage over Random for Maximum Acyclic Subgraph , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[4]  W. Fickett,et al.  Application of the Monte Carlo Method to the Lattice‐Gas Model. I. Two‐Dimensional Triangular Lattice , 1959 .

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  T. A. Azlarov,et al.  Refinements of Yu. V. Prokhorov's theorems on the asymptotic behavior of the binomial distribution , 1987 .

[7]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[8]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[9]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[10]  Alan M. Ferrenberg,et al.  New Monte Carlo technique for studying phase transitions. , 1988, Physical review letters.

[11]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[12]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[13]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[14]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[15]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[16]  Santosh S. Vempala,et al.  Fast Algorithms for Logconcave Functions: Sampling, Rounding, Integration and Optimization , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[17]  D. N. Card,et al.  Monte Carlo Estimation of the Free Energy by Multistage Sampling , 1972 .

[18]  M. Simonovits,et al.  Random walks and an O * ( n 5 ) volume algorithm for convex bodies , 1997 .

[19]  Martin E. Dyer,et al.  On Markov Chains for Independent Sets , 2000, J. Algorithms.

[20]  F. Martinelli,et al.  Approach to equilibrium of Glauber dynamics in the one phase region , 1994 .

[21]  Svante Janson,et al.  Random graphs , 2000, Wiley-Interscience series in discrete mathematics and optimization.

[22]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[23]  Santosh S. Vempala,et al.  Simulated Annealing for Convex Optimization , 2004 .

[24]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[25]  Martin E. Dyer,et al.  A random polynomial-time algorithm for approximating the volume of convex bodies , 1991, JACM.

[26]  N. Akhiezer,et al.  The Classical Moment Problem. , 1968 .

[27]  Ravi Kumar,et al.  Sampling algorithms: lower bounds and applications , 2001, STOC '01.

[28]  Michael Weba Bounds for the total variation distance between the binomial and the poisson distribution in case of medium-sized success probabilities , 1999 .

[29]  Eric Vigoda,et al.  A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries , 2004, JACM.

[30]  Mark Jerrum,et al.  Approximating the Permanent , 1989, SIAM J. Comput..

[31]  Mark Jerrum,et al.  A Very Simple Algorithm for Estimating the Number of k-Colorings of a Low-Degree Graph , 1995, Random Struct. Algorithms.

[32]  Wojciech Szpankowski,et al.  Average Case Analysis of Algorithms on Sequences: Szpankowski/Average , 2001 .

[33]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[34]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2009, SIAM J. Comput..

[35]  Leslie G. Valiant,et al.  Random Generation of Combinatorial Structures from a Uniform Distribution , 1986, Theor. Comput. Sci..

[36]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[37]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[38]  Srinivasan Seshan,et al.  Detecting DDoS Attacks on ISP Networks , 2003 .

[39]  Dror Weitz,et al.  Counting independent sets up to the tree threshold , 2006, STOC '06.

[40]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.