Estimating the number of unseen species: A bird in the hand is worth $\log n $ in the bush

Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher, uses $n$ samples to predict the number $U$ of hitherto unseen species that would be observed if $t\cdot n$ new samples were collected. Of considerable interest is the largest ratio $t$ between the number of new and existing samples for which $U$ can be accurately predicted. In seminal works, Good and Toulmin constructed an intriguing estimator that predicts $U$ for all $t\le 1$, thereby showing that the number of species can be estimated for a population twice as large as that observed. Subsequently Efron and Thisted obtained a modified estimator that empirically predicts $U$ even for some $t>1$, but without provable guarantees. We derive a class of estimators that $\textit{provably}$ predict $U$ not just for constant $t>1$, but all the way up to $t$ proportional to $\log n$. This shows that the number of species can be estimated for a population $\log n$ times larger than that observed, a factor that grows arbitrarily large as $n$ increases. We also show that this range is the best possible and that the estimators' mean-square error is optimal up to constants for any $t$. Our approach yields the first provable guarantee for the Efron-Thisted estimator and, in addition, a variant which achieves stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators we derive are simple linear estimators that are computable in time proportional to $n$. The performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product.

[1]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[2]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[3]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[4]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[5]  D. Freedman,et al.  Finite Exchangeable Sequences , 1980 .

[6]  P. Hall,et al.  On the rate of Poisson convergence , 1984, Mathematical Proceedings of the Cambridge Philosophical Society.

[7]  G. Belle,et al.  Nonparametric estimation of species richness , 1984 .

[8]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .

[9]  Peter J. Bickel,et al.  On estimating the total probability of the unobserved outcomes of an experiment , 1986 .

[10]  J. Steele An Efron-Stein inequality for nonsymmetric statistics , 1986 .

[11]  G Kolata,et al.  Shakespeare's New Poem: An Ode to Statistics: Two statisticians are using a powerful method to determine whether Shakespeare could have written the newly discovered poem that has been attributed to him. , 1986, Science.

[12]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[13]  A. Chao,et al.  Estimating the Number of Classes via Sample Coverage , 1992 .

[14]  J. Bunge,et al.  Estimating the Number of Species: A Review , 1993 .

[15]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[16]  S. Boneh,et al.  Estimating the Prediction Function and the Number of Unseen Species in Sampling with Replacement , 1998 .

[17]  A. Nemirovski,et al.  On estimation of the Lr norm of a regression function , 1999 .

[18]  D. Relman,et al.  Bacterial diversity within the human subgingival crevice. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[19]  David A. McAllester,et al.  On the Convergence Rate of Good-Turing Estimators , 2000, COLT.

[20]  J. Hughes,et al.  Counting the Uncountable: Statistical Approaches to Estimating Microbial Diversity , 2001, Applied and Environmental Microbiology.

[21]  F. Dewhirst,et al.  Bacterial Diversity in Human Subgingival Plaque , 2001, Journal of bacteriology.

[22]  A. Chao,et al.  PREDICTING THE NUMBER OF NEW SPECIES IN FURTHER TAXONOMIC SAMPLING , 2003 .

[23]  A. Chao Species Estimation and Applications , 2006 .

[24]  Cormac Herley,et al.  A large-scale study of web password habits , 2007, WWW '07.

[25]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[26]  M. Blaser,et al.  Molecular analysis of human forearm superficial skin bacterial biota , 2007, Proceedings of the National Academy of Sciences.

[27]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[28]  I. Ionita-Laza,et al.  Estimating the number of unseen variants in the human genome , 2009, Proceedings of the National Academy of Sciences.

[29]  T. Cai,et al.  Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional , 2011, 1105.3039.

[30]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[31]  Robert K. Colwell,et al.  Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages , 2012 .

[32]  Gregory Valiant,et al.  Instance Optimal Learning , 2015, ArXiv.

[33]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[34]  Alon Orlitsky,et al.  Competitive Distribution Estimation: Why is Good-Turing Good , 2015, NIPS.

[35]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[36]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[37]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.