Adaptive estimation of Shannon entropy

We consider estimating the Shannon entropy of a discrete distribution P from n i.i.d. samples. Recently, Jiao, Venkat, Han, and Weissman (JVHW), and Wu and Yang constructed approximation theoretic estimators that achieve the minimax L2 rates in estimating entropy. Their estimators are consistent given n ≫ S/lnS samples, where S is the support size, and it is the best possible sample complexity. In contrast, the Maximum Likelihood Estimator (MLE), which is the empirical entropy, requires n ≫ S samples. In the present paper we significantly refine the minimax results of existing work. To alleviate the pessimism of minimaxity, we adopt the adaptive estimation framework, and show that the JVHW estimator is an adaptive estimator, i.e., it achieves the minimax rates simultaneously over a nested sequence of subsets of distributions P, without knowing the support size S or which subset P lies in. We also characterize the maximum risk of the MLE over this nested sequence, and show, for every subset in the sequence, that the performance of the minimax rate-optimal estimator with n samples is essentially that of the MLE with n ln n samples, thereby further substantiating the generality of “effective sample size enlargement” phenomenon discovered by Jiao, Venkat, Han, and Weissman. We provide a “pointwise” explanation of the sample size enlargement phenomenon, which states that for sufficiently small probabilities, the bias function of the JVHW estimator with n samples is nearly that of the MLE with n ln n samples.

[1]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  J. Hájek A characterization of limiting distributions of regular estimates , 1970 .

[3]  J. Hájek Local asymptotic minimax and admissibility in estimation , 1972 .

[4]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[5]  A. Timan,et al.  Mathematical expectation of continuous functions of random variables. Smoothness and variance , 1977 .

[6]  V. Totik,et al.  Moduli of smoothness , 1987 .

[7]  P. Petrushev,et al.  Rational Approximation of Real Functions , 1988 .

[8]  George G. Lorentz,et al.  Constructive Approximation , 1993, Grundlehren der mathematischen Wissenschaften.

[9]  T. Sejnowski,et al.  Reliability of spike timing in neocortical neurons. , 1995, Science.

[10]  G D Lewen,et al.  Reproducibility and Variability in Neural Spike Trains , 1997, Science.

[11]  Michael J. Berry,et al.  The structure and precision of retinal spike trains. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[12]  On the lower limits of entropy estimation , 2003 .

[13]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[14]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[15]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[16]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[17]  Virgílio A. F. Almeida,et al.  Characterizing user behavior in online social networks , 2009, IMC '09.

[18]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[19]  T. Cai,et al.  Minimax and Adaptive Inference in Nonparametric Function Estimation , 2012, 1203.4911.

[20]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[21]  Yanjun Han,et al.  Minimax Estimation of Discrete Distributions under ℓ1 Loss , 2014, ArXiv.

[22]  Yanjun Han,et al.  Beyond Maximum Likelihood: from Theory to Practice , 2014, ArXiv.

[23]  T. Weissman,et al.  Non-asymptotic Theory for the Plug-in Rule in Functional Estimation , 2014 .

[24]  Himanshu Tyagi,et al.  The Complexity of Estimating Rényi Entropy , 2015, SODA.

[25]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[26]  Yanjun Han,et al.  Minimax Estimation of Discrete Distributions Under $\ell _{1}$ Loss , 2014, IEEE Transactions on Information Theory.

[27]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[28]  Yanjun Han,et al.  Beyond maximum likelihood: Boosting the Chow-Liu algorithm for large alphabets , 2016, 2016 50th Asilomar Conference on Signals, Systems and Computers.

[29]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[30]  Yanjun Han,et al.  Minimax rate-optimal estimation of KL divergence between discrete distributions , 2016, 2016 International Symposium on Information Theory and Its Applications (ISITA).

[31]  Yihong Wu,et al.  Sample complexity of the distinct elements problem , 2016, 1612.03375.

[32]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[33]  Yanjun Han,et al.  Maximum Likelihood Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[34]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.