Entropy and the species accumulation curve: a novel entropy estimator via discovery rates of new species

Estimating Shannon entropy and its exponential from incomplete samples is a central objective of many research fields. However, empirical estimates of Shannon entropy and its exponential depend strongly on sample size and typically exhibit substantial bias. This work uses a novel method to obtain an accurate, low‐bias analytic estimator of entropy, based on species frequency counts. Our estimator does not require prior knowledge of the number of species. We show that there is a close relationship between Shannon entropy and the species accumulation curve, which depicts the cumulative number of observed species as a function of sample size. We reformulate entropy in terms of the expected discovery rates of new species with respect to sample size, that is, the successive slopes of the species accumulation curve. Our estimator is obtained by applying slope estimators derived from an improved Good‐Turing frequency formula. Our method is also applied to estimate mutual information. Extensive simulations from theoretical models and real surveys show that if sample size is not unreasonably small, the resulting entropy estimator is nearly unbiased. Our estimator generally outperforms previous methods in terms of bias and accuracy (low mean squared error) especially when species richness is large and there is a large fraction of undetected species in samples. We discuss the extension of our approach to estimate Shannon entropy for multiple incidence data. The use of our estimator in constructing an integrated rarefaction and extrapolation curve of entropy (or mutual information) as a function of sample size or sample coverage (an aspect of sample completeness) is also discussed.

[1]  Unbiased Estimators for Entropy and Class Number , 2014, 1410.5002.

[2]  A. Chao,et al.  Phylogenetic beta diversity, similarity, and differentiation measures based on Hill numbers , 2014 .

[3]  Elizabeth L. Sander,et al.  Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies , 2014 .

[4]  Anne Chao,et al.  Measuring and Estimating Species Richness, Species Diversity, and Biotic Similarity from Sampling Data , 2013 .

[5]  A. Chao,et al.  Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size. , 2012, Ecology.

[6]  D. Mezger,et al.  Biodiversity Assessment in Incomplete Inventories: Leaf Litter Ant Communities in Several Types of Bornean Rain Forest , 2012, PloS one.

[7]  M. Vinck,et al.  Estimation of the entropy based on its polynomial representation. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[8]  Zhiyi Zhang,et al.  Entropy Estimation in Turing's Perspective , 2012, Neural Computation.

[9]  C. Baraloto,et al.  The decomposition of Shannon's entropy and a confidence interval for beta diversity , 2012 .

[10]  Robert K. Colwell,et al.  Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages , 2012 .

[11]  Ilya Nemenman,et al.  Coincidences and Estimation of Entropies of Random Variables with Large Cardinalities , 2011, Entropy.

[12]  William Bruce Sherwin,et al.  Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography , 2010, Entropy.

[13]  Wolfgang Schwanghart,et al.  Comparing measures of species diversity from incomplete inventories: an update , 2010 .

[14]  L. Jost The Relation between Evenness and Diversity , 2010 .

[15]  Korbinian Strimmer,et al.  Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks , 2008, J. Mach. Learn. Res..

[16]  A. Chao,et al.  A Two‐Stage Probabilistic Approach to Multiple‐Community Similarity Indices , 2008, Biometrics.

[17]  P. Grassberger Entropy Estimates from Insufficient Samplings , 2003, physics/0307138.

[18]  L. Jost Partitioning diversity into independent alpha and beta components. , 2007, Ecology.

[19]  Bin Yu,et al.  Coverage-adjusted entropy estimation. , 2007, Statistics in medicine.

[20]  A. Chao Species Estimation and Applications , 2006 .

[21]  Franck Jabot,et al.  Measurement of biological information with applications from genes to landscapes , 2006, Molecular ecology.

[22]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[23]  Jonathan D Victor,et al.  Approaches to Information-Theoretic Analysis of Neural Activity , 2006, Biological theory.

[24]  R. O’Hara Species richness estimators: how many species can dance on the head of a pin? , 2005 .

[25]  T. Schurmann,et al.  Bias Analysis in Entropy Estimation , 2004, cond-mat/0403192.

[26]  T. Olszewski A unified mathematical framework for the measurement of richness and evenness within and among multiple communities , 2004 .

[27]  William Bialek,et al.  Entropy and information in neural spike trains: progress on the sampling problem. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  A. Chao,et al.  Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample , 2004, Environmental and Ecological Statistics.

[29]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[30]  Carlo Ricotta,et al.  On parametric evenness measures. , 2003, Journal of theoretical biology.

[31]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[32]  Robert E. Ulanowicz,et al.  Information Theory in Ecology , 2001, Comput. Chem..

[33]  W. Bialek,et al.  Entropy and Inference, Revisited , 2001, NIPS.

[34]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[35]  Daniel Lee,et al.  The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species , 2001, Nucleic Acids Res..

[36]  R. Lande,et al.  When species accumulation curves intersect: implications for ranking diversity using small samples. , 2000 .

[37]  I. Good,et al.  Turing’s anticipation of empirical bayes in connection with the cryptanalysis of the naval enigma , 2000 .

[38]  R. Wiegert,et al.  Documenting Cmpleteness, Species-Area Relations, and the Species-Abundance Distribution of a Regional Flora , 1989 .

[39]  A. Magurran Ecological Diversity and Its Measurement , 1988, Springer Netherlands.

[40]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .

[41]  I. Good,et al.  Fractals: Form, Chance and Dimension , 1978 .

[42]  S. Zahl,et al.  JACKKNIFING AN INDEX OF DIVERSITY , 1977 .

[43]  Robert K. Peet,et al.  The Measurement of Species Diversity , 1974 .

[44]  R. Lewontin The Apportionment of Human Diversity , 1972 .

[45]  H. S. Horn,et al.  Measurement of "Overlap" in Comparative Ecological Studies , 1966, The American Naturalist.

[46]  R. Macarthur PATTERNS OF SPECIES DIVERSITY , 1965 .

[47]  C. Blyth Note on Estimating Information , 1959 .

[48]  G. Basharin On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables , 1959 .

[49]  R. Macarthur ON THE RELATIVE ABUNDANCE OF BIRD SPECIES. , 1957, Proceedings of the National Academy of Sciences of the United States of America.

[50]  R. Macarthur Fluctuations of Animal Populations and a Measure of Community Stability , 1955 .

[51]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[52]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[53]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[54]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[55]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .