The Broad Optimality of Profile Maximum Likelihood

We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size $k$ and desired accuracy $\varepsilon$: $\textbf{Distribution estimation}$ Under $\ell_1$ distance, PML yields optimal $\Theta(k/(\varepsilon^2\log k))$ sample complexity for sorted-distribution estimation, and a PML-based estimator empirically outperforms the Good-Turing estimator on the actual distribution; $\textbf{Additive property estimation}$ For a broad class of additive properties, the PML plug-in estimator uses just four times the sample size required by the best estimator to achieve roughly twice its error, with exponentially higher confidence; $\boldsymbol{\alpha}\textbf{-Renyi entropy estimation}$ For an integer $\alpha>1$, the PML plug-in estimator has optimal $k^{1-1/\alpha}$ sample complexity; for non-integer $\alpha>3/4$, the PML plug-in estimator has sample complexity lower than the state of the art; $\textbf{Identity testing}$ In testing whether an unknown distribution is equal to or at least $\varepsilon$ far from a given distribution in $\ell_1$ distance, a PML-based tester achieves the optimal sample complexity up to logarithmic factors of $k$. With minor modifications, most of these results also hold for a near-linear-time computable variant of PML.

[1]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[2]  Sham M. Kakade,et al.  Maximum Likelihood Estimation for Learning Populations of Parameters , 2019, ICML.

[3]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[4]  Oded Goldreich The uniform distribution is complete with respect to testing identity to a fixed distribution , 2016, Electron. Colloquium Comput. Complex..

[5]  G. Hardy,et al.  Asymptotic Formulaæ in Combinatory Analysis , 1918 .

[6]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[7]  Daniel M. Kane,et al.  Testing Identity of Structured Distributions , 2014, SODA.

[8]  Alon Orlitsky,et al.  Adaptive Estimation of Generalized Distance to Uniformity , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[9]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[10]  T. Sejnowski,et al.  Reliability of spike timing in neocortical neurons. , 1995, Science.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Tugkan Batu,et al.  Generalized Uniformity Testing , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[13]  Guy Bresler,et al.  Efficiently Learning Ising Models on Arbitrary Graphs , 2014, STOC.

[14]  A. Orlitsky,et al.  On estimating the probability multiset , 2011 .

[15]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[16]  Wulfram Gerstner,et al.  SPIKING NEURON MODELS Single Neurons , Populations , Plasticity , 2002 .

[17]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[18]  Alon Orlitsky,et al.  Data Amplification: A Unified and Competitive Approach to Property Estimation , 2019, NeurIPS.

[19]  Tsachy Weissman,et al.  Approximate Profile Maximum Likelihood , 2017, J. Mach. Learn. Res..

[20]  Alfred O. Hero,et al.  Image registration methods in high‐dimensional space , 2006, Int. J. Imaging Syst. Technol..

[21]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[22]  Alon Orlitsky,et al.  Data Amplification: Instance-Optimal Property Estimation , 2019, ICML.

[23]  Richard D. Gill,et al.  Estimating a probability mass function with unknown labels , 2013, 1312.1200.

[24]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .

[25]  Robert K. Colwell,et al.  Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages , 2012 .

[26]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[27]  B. Lindsay,et al.  Estimating the number of classes , 2007, 0708.2153.

[28]  H. Chernoff A Note on an Inequality Involving the Normal Distribution , 1981 .

[29]  Hirakendu Das Competitive tests and estimators for properties of distributions , 2012 .

[30]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[31]  Pengkun Yang Optimal entropy estimation on large alphabet: fundamental limits and fast algorithms , 2016 .

[32]  Dietrich Braess,et al.  Bernstein polynomials and learning theory , 2004, J. Approx. Theory.

[33]  Alfred O. Hero,et al.  Image registration with minimum spanning tree algorithm , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[34]  Maciej Skorski,et al.  Renyi Entropy Estimation Revisited , 2017, APPROX-RANDOM.

[35]  Moses Charikar,et al.  Efficient profile maximum likelihood for universal symmetric property estimation , 2019, STOC.

[36]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[37]  Dana Ron,et al.  Algorithmic and Analysis Techniques in Property Testing , 2010, Found. Trends Theor. Comput. Sci..

[38]  Seshadhri Comandur,et al.  Testing Expansion in Bounded Degree Graphs , 2007, Electron. Colloquium Comput. Complex..

[39]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[40]  Pascal O. Vontobel The Bethe approximation of the pattern maximum likelihood distribution , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[41]  Himanshu Tyagi,et al.  Estimating Renyi Entropy of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[42]  Alon Orlitsky,et al.  Algorithms for modeling distributions over large alphabets , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[43]  Gregory Valiant,et al.  Instance optimal learning of discrete distributions , 2016, STOC.

[44]  P. Grassberger Finite sample corrections to entropy and dimension estimates , 1988 .

[45]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[46]  Gregory Valiant,et al.  Instance Optimal Learning , 2015, ArXiv.

[47]  D. Mcneil Estimating an Author's Vocabulary , 1973 .

[48]  A. Carlton On the bias of information estimates. , 1969 .

[49]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[50]  Alon Orlitsky,et al.  On Modeling Profiles Instead of Values , 2004, UAI.

[51]  A. Chao,et al.  Estimating the Number of Classes via Sample Coverage , 1992 .

[52]  Shengjun Pan On the theory and application of pattern maximum likelihood , 2012 .

[53]  V. Rich Personal communication , 1989, Nature.

[54]  Deniz Erdogmus,et al.  Renyi's Entropy, Divergence and Their Nonparametric Estimators , 2010, Information Theoretic Learning.

[55]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[56]  J. Dicapua Chebyshev Polynomials , 2019, Fibonacci and Lucas Numbers With Applications.

[57]  Todd P. Coleman,et al.  Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs , 2013, IEEE Transactions on Signal Processing.

[58]  Clément L. Canonne,et al.  A Survey on Distribution Testing: Your Data is Big. But is it Blue? , 2020, Electron. Colloquium Comput. Complex..

[59]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[60]  Dana Ron,et al.  On the Relation Between the Relative Earth Mover Distance and the Variation Distance (an Exposition) , 2020, Computational Complexity and Property Testing.

[61]  Concha Bielza,et al.  A review of estimation of distribution algorithms in bioinformatics , 2008, BioData Mining.

[62]  Deniz Erdoğmuş,et al.  Clustering using Renyi's entropy , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[63]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[64]  David Källberg,et al.  Statistical Inference for Rényi Entropy Functionals , 2012, Conceptual Modelling and Its Theoretical Foundations.

[65]  P. Hall,et al.  On the rate of Poisson convergence , 1984, Mathematical Proceedings of the Cambridge Philosophical Society.

[66]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[67]  Alon Orlitsky,et al.  On Learning Distributions from their Samples , 2015, COLT.

[68]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[69]  D. Relman,et al.  Bacterial diversity within the human subgingival crevice. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[70]  Ilias Diakonikolas,et al.  Sample-Optimal Identity Testing with High Probability , 2017, Electron. Colloquium Comput. Complex..

[71]  Alon Orlitsky,et al.  Exact calculation of pattern probabilities , 2010, 2010 IEEE International Symposium on Information Theory.

[72]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions , 2017, ICML.

[73]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[74]  Constantinos Daskalakis,et al.  Optimal Testing for Properties of Distributions , 2015, NIPS.

[75]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[76]  Yanjun Han,et al.  Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance , 2018, COLT.

[77]  Alon Orlitsky,et al.  Estimating multiple concurrent processes , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[78]  Pascal O. Vontobel The Bethe and Sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the Valiant-Valiant estimate , 2014, 2014 Information Theory and Applications Workshop (ITA).

[79]  Ilias Diakonikolas,et al.  Optimal Algorithms for Testing Closeness of Discrete Distributions , 2013, SODA.

[80]  Daniel M. Kane,et al.  A New Approach for Testing Properties of Discrete Distributions , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[81]  J. Príncipe,et al.  Energy, entropy and information potential for neural computation , 1998 .

[82]  Anne Chao,et al.  Species Richness: Estimation and Comparison , 2016 .

[83]  Alon Orlitsky,et al.  Competitive Distribution Estimation: Why is Good-Turing Good , 2015, NIPS.

[84]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[85]  Alon Orlitsky,et al.  A Competitive Test for Uniformity of Monotone Distributions , 2013, AISTATS.

[86]  C. Papadimitriou,et al.  Algorithmic Approaches to Statistical Questions , 2012 .

[87]  Gregory Valiant,et al.  An Automatic Inequality Prover and Instance Optimal Identity Testing , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[88]  G. Crooks On Measures of Entropy and Information , 2015 .

[89]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.