Learning Populations of Parameters

Consider the following estimation problem: there are $n$ entities, each with an unknown parameter $p_i \in [0,1]$, and we observe $n$ independent random variables, $X_1,\ldots,X_n$, with $X_i \sim $ Binomial$(t, p_i)$. How accurately can one recover the "histogram" (i.e. cumulative density function) of the $p_i$'s? While the empirical estimates would recover the histogram to earth mover distance $\Theta(\frac{1}{\sqrt{t}})$ (equivalently, $\ell_1$ distance between the CDFs), we show that, provided $n$ is sufficiently large, we can achieve error $O(\frac{1}{t})$ which is information theoretically optimal. We also extend our results to the multi-dimensional parameter case, capturing settings where each member of the population has multiple associated parameters. Beyond the theoretical results, we demonstrate that the recovery algorithm performs well in practice on a variety of datasets, providing illuminating insights into several domains, including politics, sports analytics, and variation in the gender ratio of offspring.

[1]  D. G. Cran,et al.  Comparative motility of X and Y chromosome–bearing bovine sperm separated on the basis of DNA content by flow sorting , 1998, Molecular reproduction and development.

[2]  P. Barlow,et al.  The Y Chromosome in Human Spermatozoa , 1970, Nature.

[3]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[4]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[5]  Daniel M. Kane,et al.  Properly Learning Poisson Binomial Distributions in Almost Polynomial Time , 2015, COLT.

[6]  Alon Orlitsky,et al.  On Modeling Profiles Instead of Values , 2004, UAI.

[7]  Ronitt Rubinfeld,et al.  Testing Similar Means , 2014, SIAM J. Discret. Math..

[8]  Rocco A. Servedio,et al.  Learning Poisson Binomial Distributions , 2011, STOC '12.

[9]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[10]  N. Levenberg,et al.  Multivariate simultaneous approximation , 2002 .

[11]  Alon Orlitsky,et al.  Recent results on pattern maximum likelihood , 2009, 2009 IEEE Information Theory Workshop on Networking and Information Theory.

[12]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[13]  James Zou,et al.  Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects , 2015, bioRxiv.

[14]  Ronitt Rubinfeld,et al.  Testing Properties of Collections of Distributions , 2013, Theory Comput..

[15]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Optimal Distribution Property Estimation , 2016, Electron. Colloquium Comput. Complex..

[16]  Gregory Valiant,et al.  Spectrum Estimation from Samples , 2016, ArXiv.

[17]  Gregory Valiant,et al.  Instance optimal learning of discrete distributions , 2016, STOC.