Optimally Learning Populations of Parameters

Consider the following fundamental estimation problem: there are $n$ entities, each with an unknown parameter $p_i \in [0,1]$, and we observe $n$ independent random variables, $X_1,\ldots,X_n$, with $X_i \sim $ Binomial$(t, p_i)$. How accurately can one recover the ``histogram'' (i.e. cumulative density function) of the $p_i$s? While the empirical estimates would recover the histogram to earth mover distance $\Theta(\frac{1}{\sqrt{t}})$ (equivalently, $\ell_1$ distance between the CDFs), we show that, provided $n$ is sufficiently large, we can achieve error $O(\frac{1}{t})$ which is information theoretically optimal. We also extend our results to the multi-dimensional parameter case, capturing settings where each member of the population has multiple associated parameters. Beyond the theoretical results, we demonstrate that the recovery algorithm performs well in practice on a variety of datasets, providing illuminating insights into several domains, including politics, and sports analytics.

[1]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[2]  N. Levenberg,et al.  Multivariate simultaneous approximation , 2002 .

[3]  Rocco A. Servedio,et al.  Learning Poisson Binomial Distributions , 2011, STOC '12.

[4]  Alon Orlitsky,et al.  On Modeling Profiles Instead of Values , 2004, UAI.

[5]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[6]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[7]  Gregory Valiant,et al.  Spectrum Estimation from Samples , 2016, ArXiv.

[8]  Alon Orlitsky,et al.  Recent results on pattern maximum likelihood , 2009, 2009 IEEE Information Theory Workshop on Networking and Information Theory.

[9]  James Zou,et al.  Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects , 2015, bioRxiv.

[10]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[11]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[12]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Optimal Distribution Property Estimation , 2016, Electron. Colloquium Comput. Complex..

[13]  David P. Woodruff,et al.  On the exact space complexity of sketching and streaming small norms , 2010, SODA '10.

[14]  Gregory Valiant,et al.  Instance optimal learning of discrete distributions , 2016, STOC.

[15]  Daniel M. Kane,et al.  Properly Learning Poisson Binomial Distributions in Almost Polynomial Time , 2015, COLT.