Minimax rate-optimal estimation of KL divergence between discrete distributions

We refine the general methodology in [1] for the construction and analysis of essentially minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions with support size S comparable with the number of observations n. Specifically, we determine the "smooth" and "non-smooth" regimes based on the confidence set and the smoothness of the functional. In the "non-smooth" regime, we apply an unbiased estimator for a "suitable" polynomial approximation of the functional. In the "smooth" regime, we construct a bias corrected version of the Maximum Likelihood Estimator (MLE) based on Taylor expansion. We apply the general methodology to the problem of estimating the KL divergence between two discrete distributions from empirical data. We construct a minimax rate-optimal estimator which is adaptive in the sense that it does not require the knowledge of the support size nor the upper bound on the likelihood ratio. Moreover, the performance of the optimal estimator with n samples is essentially that of the MLE with n ln n samples, i.e., the effective sample size enlargement phenomenon holds.

[1]  Yanjun Han,et al.  Minimax Estimation of KL Divergence between Discrete Distributions , 2016, ArXiv.

[2]  T. Cai,et al.  Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional , 2011, 1105.3039.

[3]  Sanjeev R. Kulkarni,et al.  Universal Divergence Estimation for Finite-Alphabet Sources , 2006, IEEE Transactions on Information Theory.

[4]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[5]  C. Withers Bias reduction by Taylor series , 1987 .

[6]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[7]  Moritz Hardt,et al.  Tight Bounds for Learning a Mixture of Two Gaussians , 2014, STOC.

[8]  Murad S. Taqqu,et al.  Some facts about Charlier polynomials , 2011 .

[9]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[10]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[11]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[12]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[13]  J. Hájek Local asymptotic minimax and admissibility in estimation , 1972 .

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[16]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[17]  Yuan Xu,et al.  Approximation Theory and Harmonic Analysis on Spheres and Balls , 2013 .

[18]  A. Haar,et al.  Die Minkowskische Geometrie und die Annäherung an stetige Funktionen , 1917 .

[19]  Michael Grabchak,et al.  Nonparametric Estimation of Küllback-Leibler Divergence , 2014, Neural Computation.

[20]  Lucien Birgé Approximation dans les espaces métriques et théorie de l'estimation , 1983 .

[21]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[22]  WangQing,et al.  Divergence estimation for multidimensional densities via k-nearest-neighbor distances , 2009 .

[23]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[24]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[25]  J. Hájek A characterization of limiting distributions of regular estimates , 1970 .

[26]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[27]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[28]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[29]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[30]  W. H. Pun,et al.  Statistical Decision Theory , 2014 .

[31]  Yanjun Han,et al.  Minimax Estimation of the $L_{1}$ Distance , 2018, IEEE Transactions on Information Theory.

[32]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[33]  M. A. Qazi,et al.  Some Coefficient Estimates for Polynomials on the Unit Interval , 2007 .

[34]  I. N. Sanov On the probability of large deviations of random variables , 1958 .

[35]  Vilmos Totik Polynomial Approximation on Polytopes , 2014 .

[36]  J. Picard,et al.  Statistical learning theory and stochastic optimization : École d'eté de probabilités de Saint-Flour XXXI - 2001 , 2004 .

[37]  Gregory Valiant,et al.  Learning Populations of Parameters , 2017, NIPS.

[38]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[39]  T. J. Rivlin The Chebyshev polynomials , 1974 .

[40]  James Zou,et al.  Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects , 2015, bioRxiv.

[41]  E. Hellinger,et al.  Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. , 1909 .

[42]  E. Cheney Introduction to approximation theory , 1966 .

[43]  V. Totik,et al.  Moduli of smoothness , 1987 .

[44]  John R. Rice Tchebycheff approximation in several variables , 1963 .

[45]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[46]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[47]  Gregory Valiant,et al.  Spectrum Estimation from Samples , 2016, ArXiv.

[48]  B. Park,et al.  Estimation of Kullback–Leibler Divergence by Local Likelihood , 2006 .

[49]  A. Nemirovski,et al.  On estimation of the Lr norm of a regression function , 1999 .

[50]  Fernando Pérez-Cruz,et al.  Kullback-Leibler divergence estimation of continuous distributions , 2008, 2008 IEEE International Symposium on Information Theory.

[51]  Himanshu Tyagi,et al.  The Complexity of Estimating Rényi Entropy , 2015, SODA.

[52]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[53]  A. Carlton On the bias of information estimates. , 1969 .

[54]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[55]  Alon Orlitsky,et al.  On Modeling Profiles Instead of Values , 2004, UAI.

[56]  Yingbin Liang,et al.  Estimation of KL divergence between large-alphabet distributions , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[57]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[58]  Yingbin Liang,et al.  Estimation of KL Divergence: Optimal Minimax Rate , 2016, IEEE Transactions on Information Theory.

[59]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[60]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[61]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions , 2017, ICML.

[62]  J. Steele An Efron-Stein inequality for nonsymmetric statistics , 1986 .

[63]  Qing Wang,et al.  Divergence estimation of continuous distributions based on data-dependent partitions , 2005, IEEE Transactions on Information Theory.

[64]  Dietrich Braess,et al.  Bernstein polynomials and learning theory , 2004, J. Approx. Theory.

[65]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .