论文信息 - Learning Populations of Parameters

Learning Populations of Parameters

Consider the following estimation problem: there are $n$ entities, each with an unknown parameter $p_i \in [0,1]$, and we observe $n$ independent random variables, $X_1,\ldots,X_n$, with $X_i \sim $ Binomial$(t, p_i)$. How accurately can one recover the "histogram" (i.e. cumulative density function) of the $p_i$'s? While the empirical estimates would recover the histogram to earth mover distance $\Theta(\frac{1}{\sqrt{t}})$ (equivalently, $\ell_1$ distance between the CDFs), we show that, provided $n$ is sufficiently large, we can achieve error $O(\frac{1}{t})$ which is information theoretically optimal. We also extend our results to the multi-dimensional parameter case, capturing settings where each member of the population has multiple associated parameters. Beyond the theoretical results, we demonstrate that the recovery algorithm performs well in practice on a variety of datasets, providing illuminating insights into several domains, including politics, sports analytics, and variation in the gender ratio of offspring.

[1] D. G. Cran,et al. Comparative motility of X and Y chromosome–bearing bovine sperm separated on the basis of DNA content by flow sorting , 1998, Molecular reproduction and development.

[2] P. Barlow,et al. The Y Chromosome in Human Spermatozoa , 1970, Nature.

[3] A. Suresh,et al. Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[4] Gregory Valiant,et al. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[5] Daniel M. Kane,et al. Properly Learning Poisson Binomial Distributions in Almost Polynomial Time , 2015, COLT.

[6] Alon Orlitsky,et al. On Modeling Profiles Instead of Values , 2004, UAI.

[7] Ronitt Rubinfeld,et al. Testing Similar Means , 2014, SIAM J. Discret. Math..

[8] Rocco A. Servedio,et al. Learning Poisson Binomial Distributions , 2011, STOC '12.

[9] Gregory Valiant,et al. Estimating the Unseen , 2017, J. ACM.

[10] N. Levenberg,et al. Multivariate simultaneous approximation , 2002 .

[11] Alon Orlitsky,et al. Recent results on pattern maximum likelihood , 2009, 2009 IEEE Information Theory Workshop on Networking and Information Theory.

[12] J. Neyman,et al. INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[13] James Zou,et al. Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects , 2015, bioRxiv.

[14] Ronitt Rubinfeld,et al. Testing Properties of Collections of Distributions , 2013, Theory Comput..

[15] Alon Orlitsky,et al. A Unified Maximum Likelihood Approach for Optimal Distribution Property Estimation , 2016, Electron. Colloquium Comput. Complex..

[16] Gregory Valiant,et al. Spectrum Estimation from Samples , 2016, ArXiv.

[17] Gregory Valiant,et al. Instance optimal learning of discrete distributions , 2016, STOC.