Maximum Likelihood Estimation of Functionals of Discrete Distributions

We consider the problem of estimating functionals of discrete distributions, and focus on a tight (up to universal multiplicative constants for each specific functional) nonasymptotic analysis of the worst case squared error risk of widely used estimators. We apply concentration inequalities to analyze the random fluctuation of these estimators around their expectations and the theory of approximation using positive linear operators to analyze the deviation of their expectations from the true functional, namely their <italic>bias</italic>. We explicitly characterize the worst case squared error risk incurred by the maximum likelihood estimator (MLE) in estimating the Shannon entropy <inline-formula> <tex-math notation="LaTeX">$H(P) = \sum _{i = 1}^{S} -p_{i} \ln p_{i}$ </tex-math></inline-formula>, and the power sum <inline-formula> <tex-math notation="LaTeX">$F_\alpha (P) = \sum _{i = 1}^{S} p_{i}^\alpha ,\alpha >0$ </tex-math></inline-formula>, up to universal multiplicative constants for each fixed functional, for any alphabet size <inline-formula> <tex-math notation="LaTeX">$S\leq \infty $ </tex-math></inline-formula> and sample size <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> for which the risk may vanish. As a corollary, for Shannon entropy estimation, we show that it is necessary and sufficient to have <inline-formula> <tex-math notation="LaTeX">$n \gg S$ </tex-math></inline-formula> observations for the MLE to be consistent. In addition, we establish that it is necessary and sufficient to consider <inline-formula> <tex-math notation="LaTeX">$n \gg S^{1/\alpha }$ </tex-math></inline-formula> samples for the MLE to consistently estimate <inline-formula> <tex-math notation="LaTeX">$F_\alpha (P), 0<\alpha <1$ </tex-math></inline-formula>. The minimax rate-optimal estimators for both problems require <inline-formula> <tex-math notation="LaTeX">$S/\ln S$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$S^{1/\alpha }/\ln S$ </tex-math></inline-formula> samples, which implies that the MLE has a strictly sub-optimal sample complexity. When <inline-formula> <tex-math notation="LaTeX">$1<\alpha <3/2$ </tex-math></inline-formula>, we show that the worst case squared error rate of convergence for the MLE is <inline-formula> <tex-math notation="LaTeX">$n^{-2(\alpha -1)}$ </tex-math></inline-formula> for infinite alphabet size, while the minimax squared error rate is <inline-formula> <tex-math notation="LaTeX">$(n\ln n)^{-2(\alpha -1)}$ </tex-math></inline-formula>. When <inline-formula> <tex-math notation="LaTeX">$\alpha \geq 3/2$ </tex-math></inline-formula>, the MLE achieves the minimax optimal rate <inline-formula> <tex-math notation="LaTeX">$n^{-1}$ </tex-math></inline-formula> regardless of the alphabet size. As an application of the general theory, we analyze the Dirichlet prior smoothing techniques for Shannon entropy estimation. In this context, one approach is to plug-in the Dirichlet prior smoothed distribution into the entropy functional, while the other one is to calculate the Bayes estimator for entropy under the Dirichlet prior for squared error, which is the conditional expectation. We show that in general such estimators do <italic>not</italic> improve over the maximum likelihood estimator. No matter how we tune the parameters in the Dirichlet prior, this approach cannot achieve the minimax rates in entropy estimation. The performance of the minimax rate-optimal estimator with <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> samples is essentially <italic>at least</italic> as good as that of Dirichlet smoothed entropy estimators with <inline-formula> <tex-math notation="LaTeX">$n\ln n$ </tex-math></inline-formula> samples.

[1]  Jonathan W. Pillow,et al.  Bayesian estimation of discrete entropy with mixtures of stick-breaking priors , 2012, NIPS.

[2]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[3]  A. Antos,et al.  Convergence properties of functional estimates for discrete distributions , 2001 .

[4]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[5]  Sanjeev R. Kulkarni,et al.  Universal Estimation of Information Measures for Analog Sources , 2009, Found. Trends Commun. Inf. Theory.

[6]  T. Tony Cai,et al.  Optimal adaptive estimation of a quadratic functional , 1996 .

[7]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[8]  Christopher G. Small,et al.  Expansions and Asymptotics for Statistics , 2010 .

[9]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[10]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[11]  Tsachy Weissman,et al.  Justification of Logarithmic Loss via the Benefit of Side Information , 2014, IEEE Transactions on Information Theory.

[12]  Z. Ditzian Polynomial Approximation and ω rφ ( f , t ) Twenty Years Later , 2007 .

[13]  R. Z. Khasʹminskiĭ,et al.  Statistical estimation : asymptotic theory , 1981 .

[14]  Sharmishtha Mitra,et al.  Theory of Point Estimation - Web course , 2000 .

[15]  David R. Wolf,et al.  Estimating functions of probability distributions from a finite set of samples. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[16]  J. Hájek Local asymptotic minimax and admissibility in estimation , 1972 .

[17]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[18]  Lejla Batina,et al.  Mutual Information Analysis: a Comprehensive Study , 2011, Journal of Cryptology.

[19]  Harro Walk,et al.  Probabilistic methods in the approximation by linear positive operators , 1980 .

[20]  Gianluca Bontempi,et al.  On the Impact of Entropy Estimation on Transcriptional Regulatory Network Inference Based on Mutual Information , 2008, EURASIP J. Bioinform. Syst. Biol..

[21]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[22]  Alfred O. Hero,et al.  Applications of entropic spanning graphs , 2002, IEEE Signal Process. Mag..

[23]  J. Hájek A characterization of limiting distributions of regular estimates , 1970 .

[24]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[25]  Steffen Schober,et al.  Some worst-case bounds for Bayesian estimators of discrete distributions , 2013, 2013 IEEE International Symposium on Information Theory.

[26]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[27]  George G. Lorentz,et al.  Inverse Theorems for Bernstein Polynomials , 1997 .

[28]  A. Carlton On the bias of information estimates. , 1969 .

[29]  M. Hill Diversity and Evenness: A Unifying Notation and Its Consequences , 1973 .

[30]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[31]  Necdet Batir,et al.  Inequalities for the gamma function , 2008 .

[32]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[33]  Radu P ˘ alt On some constants in approximation by Bernstein operators , 2008 .

[34]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[35]  I. A. Ibragimov,et al.  Estimation of Linear Functionals in Gaussian Noise , 1988 .

[36]  Radu Păltănea Approximation theory using positive linear operators , 2004 .

[37]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[38]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[39]  I. Nemenman Inference of entropies of discrete random variables with unknown cardinalities , 2002, physics/0207009.

[40]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[41]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[42]  Yanjun Han,et al.  Minimax Estimation of KL Divergence between Discrete Distributions , 2016, ArXiv.

[43]  S. Zahl,et al.  JACKKNIFING AN INDEX OF DIVERSITY , 1977 .

[44]  Yanjun Han,et al.  Adaptive estimation of Shannon entropy , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[45]  Ilya Nemenman,et al.  Coincidences and Estimation of Entropies of Random Variables with Large Cardinalities , 2011, Entropy.

[46]  A. Timan,et al.  Mathematical expectation of continuous functions of random variables. Smoothness and variance , 1977 .

[47]  Philippe Flajolet,et al.  Singularity Analysis and Asymptotics of Bernoulli Sums , 1999, Theor. Comput. Sci..

[48]  Paul A. Viola,et al.  Alignment by Maximization of Mutual Information , 1997, International Journal of Computer Vision.

[49]  Korbinian Strimmer,et al.  Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks , 2008, J. Mach. Learn. Res..

[50]  P. Massart,et al.  Estimation of Integral Functionals of a Density , 1995 .

[51]  Yingbin Liang,et al.  Estimation of KL Divergence: Optimal Minimax Rate , 2016, IEEE Transactions on Information Theory.

[52]  T. Tony Cai,et al.  Nonquadratic estimators of a quadratic functional , 2005 .

[53]  William Bialek,et al.  Entropy and information in neural spike trains: progress on the sampling problem. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[54]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[55]  A. Rényi On Measures of Entropy and Information , 1961 .

[56]  Max A. Viergever,et al.  Mutual-information-based registration of medical images: a survey , 2003, IEEE Transactions on Medical Imaging.

[57]  F. Franchini,et al.  Renyi entropy of the XY spin chain , 2007, 0707.2534.

[58]  V. Totik,et al.  Moduli of smoothness , 1987 .

[59]  Yanjun Han,et al.  Beyond Maximum Likelihood: from Theory to Practice , 2014, ArXiv.

[60]  A. Chao,et al.  Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample , 2004, Environmental and Ecological Statistics.

[61]  P. Hall,et al.  On the estimation of entropy , 1993 .

[62]  Dietrich Braess,et al.  Bernstein polynomials and learning theory , 2004, J. Approx. Theory.

[63]  D. Holste,et al.  Bayes' estimators of generalized entropies , 1998 .

[64]  Himanshu Tyagi,et al.  The Complexity of Estimating Rényi Entropy , 2015, SODA.

[65]  T. Tony Cai,et al.  On Adaptive Estimation of Linear Functionals , 2005 .

[66]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[67]  Peter Grassberger,et al.  Entropy estimation of symbol sequences. , 1996, Chaos.

[68]  Lothar Hahn A note on stochastic methods in connection with approximation theorems for positive linear operators , 1982 .

[69]  K. Joag-dev,et al.  Negative Association of Random Variables with Applications , 1983 .

[70]  Han Yanjun,et al.  Beyond maximum likelihood: Boosting the Chow-Liu algorithm for large alphabets , 2016 .

[71]  Sam Efromovich,et al.  On Bickel and Ritov's conjecture about adaptive estimation of the integral of the square of density derivative , 1996 .

[72]  V. Totik APPROXIMATION BY BERNSTEIN POLYNOMIALS , 1994 .

[73]  A. Nemirovskii,et al.  Some Problems on Nonparametric Estimation in Gaussian White Noise , 1987 .

[74]  D. Donoho,et al.  Geometrizing Rates of Convergence, III , 1991 .

[75]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[76]  Philippe Jacquet,et al.  Entropy Computations via Analytic Depoissonization , 1999, IEEE Trans. Inf. Theory.

[77]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[78]  Michael Nussbaum,et al.  Minimax quadratic estimation of a quadratic functional , 1990, J. Complex..

[79]  P. Diaconis,et al.  Closed Form Summation for Classical Distributions: Variations on a Theme , 2016 .

[80]  Carsten O. Daub,et al.  Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data , 2004, BMC Bioinformatics.

[81]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[82]  Sergio Verdú,et al.  Cumulant generating function of codeword lengths in optimal lossless compression , 2014, 2014 IEEE International Symposium on Information Theory.

[83]  Liam Paninski,et al.  Undersmoothed Kernel Entropy Estimators , 2008, IEEE Transactions on Information Theory.

[84]  Himanshu Tyagi,et al.  Estimating Renyi Entropy of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[85]  Jianqing Fan,et al.  On the estimation of quadratic functionals , 1991 .

[86]  B. Harris The Statistical Estimation of Entropy in the Non-Parametric Case , 1975 .

[87]  P. Grassberger Entropy Estimates from Insufficient Samplings , 2003, physics/0307138.

[88]  R. Gill,et al.  Applications of the van Trees inequality : a Bayesian Cramr-Rao bound , 1995 .

[89]  Sergio Verdú,et al.  Variable-length lossy compression and channel coding: Non-asymptotic converses via cumulant generating functions , 2014, 2014 IEEE International Symposium on Information Theory.

[90]  B. Levit,et al.  On a Non-Parametric Analogue of the Information Matrix , 1977 .

[91]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[92]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[93]  Yanjun Han,et al.  Minimax Estimation of Discrete Distributions Under $\ell _{1}$ Loss , 2014, IEEE Transactions on Information Theory.

[94]  P. Grassberger Finite sample corrections to entropy and dimension estimates , 1988 .

[95]  Zbigniew Golebiewski,et al.  On Delta-Method of Moments and Probabilistic Sums , 2013, ANALCO.

[96]  William Bialek,et al.  Entropy and Inference, Revisited , 2001, NIPS.

[97]  M. Vinck,et al.  Estimation of the entropy based on its polynomial representation. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[98]  David Lindley,et al.  Statistical Decision Functions , 1951, Nature.

[99]  Yihong Wu,et al.  Sample complexity of the distinct elements problem , 2016, 1612.03375.

[100]  P. Hall The Bootstrap and Edgeworth Expansion , 1992 .

[101]  D. Cox,et al.  Asymptotic techniques for use in statistics , 1989 .

[102]  Yanjun Han,et al.  Minimax estimation of discrete distributions , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[103]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .