A General Framework for Symmetric Property Estimation

In this paper we provide a general framework for estimating symmetric properties of distributions from i.i.d. samples. For a broad class of symmetric properties we identify the {\em easy} region where empirical estimation works and the {\em difficult} region where more complex estimators are required. We show that by approximately computing the profile maximum likelihood (PML) distribution \cite{ADOS16} in this difficult region we obtain a symmetric property estimation framework that is sample complexity optimal for many properties in a broader parameter regime than previous universal estimation approaches based on PML. The resulting algorithms based on these \emph{pseudo PML distributions} are also more practical.

[1]  Yanjun Han,et al.  Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance , 2018, COLT.

[2]  T. Cai,et al.  Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional , 2011, 1105.3039.

[3]  Yanjun Han,et al.  Minimax Estimation of the $L_{1}$ Distance , 2018, IEEE Transactions on Information Theory.

[4]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[5]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[6]  Alon Orlitsky,et al.  Algorithms for modeling distributions over large alphabets , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[7]  Himanshu Tyagi,et al.  The Complexity of Estimating Rényi Entropy , 2015, SODA.

[8]  Tsachy Weissman,et al.  Approximate Profile Maximum Likelihood , 2017, J. Mach. Learn. Res..

[9]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Optimal Distribution Property Estimation , 2016, Electron. Colloquium Comput. Complex..

[10]  Yanjun Han,et al.  On Estimation of L{r}-Norms in Gaussian White Noise Models , 2017, Probability Theory and Related Fields.

[11]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[12]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[13]  Alon Orlitsky,et al.  The Broad Optimality of Profile Maximum Likelihood , 2019, NeurIPS.

[14]  A. Timan Theory of Approximation of Functions of a Real Variable , 1994 .

[15]  Alon Orlitsky,et al.  Data Amplification: A Unified and Competitive Approach to Property Estimation , 2019, NeurIPS.

[16]  Yihong Wu,et al.  Sample complexity of the distinct elements problem , 2016, 1612.03375.

[17]  A. Nemirovski On tractable approximations of randomly perturbed convex constraints , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[18]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[19]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[20]  Alon Orlitsky,et al.  Data Amplification: Instance-Optimal Property Estimation , 2019, ICML.

[21]  Yanjun Han,et al.  Optimal rates of entropy estimation over Lipschitz balls , 2017, The Annals of Statistics.

[22]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[23]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[24]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[25]  Moses Charikar,et al.  Efficient profile maximum likelihood for universal symmetric property estimation , 2019, STOC.

[26]  James Zou,et al.  Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects , 2015, bioRxiv.

[27]  James Zou,et al.  Estimating the unseen from multiple populations , 2017, ICML.

[28]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.