Approximate Profile Maximum Likelihood

We propose an efficient algorithm for approximate computation of the profile maximum likelihood (PML), a variant of maximum likelihood maximizing the probability of observing a sufficient statistic rather than the empirical sample. The PML has appealing theoretical properties, but is difficult to compute exactly. Inspired by observations gleaned from exactly solvable cases, we look for an approximate PML solution, which, intuitively, clumps comparably frequent symbols into one symbol. This amounts to lower-bounding a certain matrix permanent by summing over a subgroup of the symmetric group rather than the whole group during the computation. We extensively experiment with the approximate solution, and find the empirical performance of our approach is competitive and sometimes significantly better than state-of-the-art performance for various estimation problems.

[1]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[2]  D. Harville Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems , 1977 .

[3]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[4]  Alon Orlitsky,et al.  On Modeling Profiles Instead of Values , 2004, UAI.

[5]  Tsachy Weissman,et al.  Relations Between Information and Estimation in Discrete-Time Lévy Channels , 2014, IEEE Transactions on Information Theory.

[6]  James Zou,et al.  Estimating the unseen from multiple populations , 2017, ICML.

[7]  G. Hardy,et al.  Asymptotic Formulaæ in Combinatory Analysis , 1918 .

[8]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[9]  Navin Kashyap,et al.  Phase Transitions for the Uniform Distribution in the Pattern Maximum Likelihood Problem and its Bethe Approximation , 2017, SIAM J. Discret. Math..

[10]  L. L. Cam,et al.  Maximum likelihood : an introduction , 1990 .

[11]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[12]  Haizhou Wang,et al.  Ckmeans.1d.dp: Optimal k-means Clustering in One Dimension by Dynamic Programming , 2011, R J..

[13]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[14]  H. D. Patterson,et al.  Recovery of inter-block information when block sizes are unequal , 1971 .

[15]  S POLYTOPE,et al.  On the Number of Convex Lattice Polytopes , 2005 .

[16]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions , 2017, ICML.

[17]  .. W. V. Der,et al.  On Profile Likelihood , 2000 .

[18]  Jaroslav Hájek,et al.  On basic concepts of statistics , 1967 .

[19]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[20]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[21]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[22]  Yanjun Han,et al.  Minimax rate-optimal estimation of KL divergence between discrete distributions , 2016, 2016 International Symposium on Information Theory and Its Applications (ISITA).

[23]  Alon Orlitsky,et al.  Algorithms for modeling distributions over large alphabets , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[24]  Navin Kashyap,et al.  A phase transition for the uniform distribution in the pattern maximum likelihood problem , 2013, 2013 IEEE Information Theory Workshop (ITW).

[25]  Bradley Efron,et al.  Maximum Likelihood and Decision Theory , 1982 .

[26]  Daniel Pérez Palomar,et al.  Lautum Information , 2008, IEEE Transactions on Information Theory.

[27]  Katja Gruenewald,et al.  Theory Of Games And Statistical Decisions , 2016 .

[28]  Alon Orlitsky,et al.  The maximum likelihood probability of unique-singleton, ternary, and length-7 patterns , 2009, 2009 IEEE International Symposium on Information Theory.

[29]  Geoffrey Gregory,et al.  Foundations of Statistical Inference , 1973 .

[30]  Alon Orlitsky,et al.  Competitive Closeness Testing , 2011, COLT.

[31]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[32]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[33]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[34]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[35]  D. Basu On Partial Sufficiency: A Review* , 1978 .

[36]  L. L. Cam,et al.  Sufficiency and Approximate Sufficiency , 1964 .

[37]  Sang Geun Hahn,et al.  Partitions of bipartite numbers , 1997, Graphs Comb..

[38]  Pascal O. Vontobel The Bethe approximation of the pattern maximum likelihood distribution , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[39]  Yanjun Han,et al.  Minimax Estimation of the $L_{1}$ Distance , 2018, IEEE Transactions on Information Theory.

[40]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[41]  Richard D. Gill,et al.  Estimating a probability mass function with unknown labels , 2013, 1312.1200.

[42]  S. S. Vallender Calculation of the Wasserstein Distance Between Probability Distributions on the Line , 1974 .

[43]  Yingbin Liang,et al.  Estimation of KL Divergence: Optimal Minimax Rate , 2016, IEEE Transactions on Information Theory.