Estimating the unseen: A sublinear-sample canonical estimator of distributions

We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear-sample additive estimators for a class of properties that includes entropy and distribution support size. Together with the lower bounds proven in the companion paper [29], this settles the longstanding question of the sample complexities of these estimation problems (up to constant factors). Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n/ log n) samples; [29] shows that no algorithm on o(n/ log n) samples can achieve this (where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound the probability of any element of the domain). Previously, no explicit sublinear-sample algorithms for either of these problems were known. Additionally, our algorithm runs in time linear in the number of samples used. Think not, because no man sees, Such things will remain unseen. –Henry Wadsworth Longellow, from “The Builders”.

[1]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[2]  Felix Schlenk,et al.  Proof of Theorem 3 , 2005 .

[3]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[4]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[5]  Alon Orlitsky,et al.  On Modeling Profiles Instead of Values , 2004, UAI.

[6]  P. Glynn Upper bounds on Poisson tail probabilities , 1987 .

[7]  David A. McAllester,et al.  On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[8]  Ravi Kumar,et al.  Sampling algorithms: lower bounds and applications , 2001, STOC '01.

[9]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[10]  Sanjeev R. Kulkarni,et al.  A Better Good-Turing Estimator for Sequence Probabilities , 2007, 2007 IEEE International Symposium on Information Theory.

[11]  Krzysztof Onak,et al.  Sketching and Streaming Entropy via Approximation Theory , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[12]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[13]  David P. Woodruff The average-case complexity of counting distinct elements , 2009, ICDT '09.

[14]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[15]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[16]  Graham Cormode,et al.  A near-optimal algorithm for computing the entropy of a stream , 2007, SODA '07.

[17]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[18]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[19]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[20]  Alon Orlitsky,et al.  Always Good Turing: Asymptotically Optimal Probability Estimation , 2003, Science.

[21]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[22]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[23]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[24]  Gregory Valiant,et al.  A CLT and tight lower bounds for estimating entropy , 2010, Electron. Colloquium Comput. Complex..

[25]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2009, SIAM J. Comput..

[26]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[27]  Alon Orlitsky,et al.  The maximum likelihood probability of unique-singleton, ternary, and length-7 patterns , 2009, 2009 IEEE International Symposium on Information Theory.

[28]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[29]  Sanjeev R. Kulkarni,et al.  Strong Consistency of the Good-Turing Estimator , 2006, 2006 IEEE International Symposium on Information Theory.

[30]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[31]  Tugkan Batu Testing Properties of Distributions , 2001 .