Maximum Likelihood Estimation for Learning Populations of Parameters

Consider a setting with $N$ independent individuals, each with an unknown parameter, $p_i \in [0, 1]$ drawn from some unknown distribution $P^\star$. After observing the outcomes of $t$ independent Bernoulli trials, i.e., $X_i \sim \text{Binomial}(t, p_i)$ per individual, our objective is to accurately estimate $P^\star$. This problem arises in numerous domains, including the social sciences, psychology, health-care, and biology, where the size of the population under study is usually large while the number of observations per individual is often limited. Our main result shows that, in the regime where $t \ll N$, the maximum likelihood estimator (MLE) is both statistically minimax optimal and efficiently computable. Precisely, for sufficiently large $N$, the MLE achieves the information theoretic optimal error bound of $\mathcal{O}(\frac{1}{t})$ for $t < c\log{N}$, with regards to the earth mover's distance (between the estimated and true distributions). More generally, in an exponentially large interval of $t$ beyond $c \log{N}$, the MLE achieves the minimax error bound of $\mathcal{O}(\frac{1}{\sqrt{t\log N}})$. In contrast, regardless of how large $N$ is, the naive "plug-in" estimator for this problem only achieves the sub-optimal error of $\Theta(\frac{1}{\sqrt{t}})$.

[1]  D. Jackson The general theory of approximation by polynomials and trigonometric sums , 1921 .

[2]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[3]  I. Good,et al.  THE NUMBER OF NEW SPECIES, AND THE INCREASE IN POPULATION COVERAGE, WHEN A SAMPLE IS INCREASED , 1956 .

[4]  F. Lord A strong true-score theory, with applications. , 1965, Psychometrika.

[5]  R. DeVore,et al.  A proof of Jackson's theorem , 1969 .

[6]  F. Lord Estimating true-score distributions in psychological testing (an empirical bayes estimation problem) , 1969 .

[7]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[8]  L. Simar Maximum Likelihood Estimation of a Compound Poisson Process , 1976 .

[9]  B. Turnbull The Empirical Distribution Function with Arbitrarily Grouped, Censored, and Truncated Data , 1976 .

[10]  N. Laird Nonparametric Maximum Likelihood Estimation of a Mixing Distribution , 1978 .

[11]  N. Cressie A quick and easy empirical Bayes estimate of true scores , 1979 .

[12]  B. Lindsay The Geometry of Mixture Likelihoods, Part II: The Exponential Family , 1983 .

[13]  B. Lindsay The Geometry of Mixture Likelihoods: A General Theory , 1983 .

[14]  W. Millar Distribution of body weight and height: comparison of estimates based on self-reported and observed measures. , 1986, Journal of epidemiology and community health.

[15]  D. Böhning Likelihood inference for mixtures: Geometrical and other constructions of monotone step-length algorithms , 1989 .

[16]  P. Dixon,et al.  Small-scale environmental heterogeneity and the analysis of species distributions along gradients , 1990 .

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  J. Kalbfleisch,et al.  An Algorithm for Computing the Nonparametric MLE of a Mixing Distribution , 1992 .

[19]  Robert K. Colwell,et al.  Estimating terrestrial biodiversity through extrapolation. , 1994, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[20]  G. Wood Binomial mixtures: geometric estimation of the mixing distribution , 1999 .

[21]  G. Bell,et al.  Environmental heterogeneity and species diversity of forest sedges , 2000 .

[22]  A. Rababah Transformation of Chebyshev–Bernstein Polynomial Basis , 2003 .

[23]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[24]  Alon Orlitsky,et al.  On Modeling Profiles Instead of Values , 2004, UAI.

[25]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[26]  Stephen P. Boyd,et al.  Graph Implementations for Nonsmooth Convex Programs , 2008, Recent Advances in Learning and Control.

[27]  Alon Orlitsky,et al.  Recent results on pattern maximum likelihood , 2009, 2009 IEEE Information Theory Workshop on Networking and Information Theory.

[28]  Alon Orlitsky,et al.  Exact calculation of pattern probabilities , 2010, 2010 IEEE International Symposium on Information Theory.

[29]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[30]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[31]  Pascal O. Vontobel The Bethe approximation of the pattern maximum likelihood distribution , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[32]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[33]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[34]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[35]  Gregory Valiant,et al.  Instance optimal learning of discrete distributions , 2016, STOC.

[36]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[37]  Gregory Valiant,et al.  Learning Populations of Parameters , 2017, NIPS.

[38]  Gregory Valiant,et al.  Optimally Learning Populations of Parameters , 2017, NIPS 2017.

[39]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions , 2017, ICML.

[40]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[41]  Yanjun Han,et al.  Minimax Estimation of the $L_{1}$ Distance , 2018, IEEE Transactions on Information Theory.

[42]  Yanjun Han,et al.  Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance , 2018, COLT.

[43]  Tsachy Weissman,et al.  Concentration Inequalities for the Empirical Distribution , 2018, ArXiv.

[44]  Moses Charikar,et al.  Efficient profile maximum likelihood for universal symmetric property estimation , 2019, STOC.

[45]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.