Bessel Smoothing and Multi-Distribution Property Estimation

We consider a basic problem in statistical learning: estimating properties of multiple discrete distributions. Denoting by ∆k the standard simplex over [k] := {0, 1, . . . , k}, a property of d distributions is a mapping from ∆k to R. These properties include well-known distribution characteristics such as Shannon entropy and support size (d = 1), and many important divergence measures between distributions (d = 2). The primary problem being considered is to learn the property value of an unknown d-tuple of distributions from its sample. The study of such problems dates back to the works of Good (1953); Carlton (1969); Efron and Thisted (1976); Thisted and Efron (1987), and has been pushed forward steadily during the past decades. Surprisingly, before our work, the general landscape of this fundamental learning problem was insufficiently understood, and nearly all the existing results are for the special case d ≤ 2. Our first main result provides a near-linear-time computable algorithm that, given independent samples from any collection of distributions and for a broad class of multi-distribution properties, learns the property as well as the empirical plug-in estimator that uses samples with logarithmicfactor larger sizes. As a corollary of this, for any ε > 0 and fixed d ∈ Z, a d-distribution property over [k] that is Lipschitz and additively separable can be learned to an accuracy of ε using a sample of sizeO(k/(ε √ log k)), with high probability. Our second result addresses a closely related problem – tolerant independence testing: One receives samples from the unknown joint and marginal distributions, and attempts to infer the `1 distance between the joint distribution and the product distribution of the marginals. We show that this testing problem also admits a sample complexity sub-linear in the alphabet sizes, demonstrating the broad applicability of our approach.

[1]  Guy Bresler,et al.  Efficiently Learning Ising Models on Arbitrary Graphs , 2014, STOC.

[2]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Anne Chao,et al.  Species Richness: Estimation and Comparison , 2016 .

[4]  Yanjun Han,et al.  Minimax rate-optimal estimation of KL divergence between discrete distributions , 2016, 2016 International Symposium on Information Theory and Its Applications (ISITA).

[5]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .

[6]  Alon Orlitsky,et al.  The Broad Optimality of Profile Maximum Likelihood , 2019, NeurIPS.

[7]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[8]  B. Lindsay,et al.  Estimating the number of classes , 2007, 0708.2153.

[9]  L. Milne‐Thomson A Treatise on the Theory of Bessel Functions , 1945, Nature.

[10]  G D Lewen,et al.  Reproducibility and Variability in Neural Spike Trains , 1997, Science.

[11]  Ronitt Rubinfeld,et al.  Testing Mixtures of Discrete Distributions , 2019, COLT.

[12]  James Zou,et al.  Estimating the unseen from multiple populations , 2017, ICML.

[13]  Jorge Bustamante,et al.  Bernstein Operators and Their Properties , 2017 .

[14]  Alon Orlitsky,et al.  Data Amplification: Instance-Optimal Property Estimation , 2019, ICML.

[15]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[16]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions , 2017, ICML.

[17]  Alon Orlitsky,et al.  Data Amplification: A Unified and Competitive Approach to Property Estimation , 2019, NeurIPS.

[18]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[19]  Ping Li Generalized Intersection Kernel , 2016, ArXiv.

[20]  Ilias Diakonikolas,et al.  Optimal Algorithms for Testing Closeness of Discrete Distributions , 2013, SODA.

[21]  Flemming Topsøe,et al.  Some inequalities for information divergence and related measures of discrimination , 2000, IEEE Trans. Inf. Theory.

[22]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[23]  Noga Alon,et al.  Testing k-wise and almost k-wise independence , 2007, STOC '07.

[24]  Ping Li,et al.  Approximating Higher-Order Distances Using Random Projections , 2010, UAI.

[25]  I. Ionita-Laza,et al.  Estimating the number of unseen variants in the human genome , 2009, Proceedings of the National Academy of Sciences.

[26]  A. Chao,et al.  Estimating the Number of Classes via Sample Coverage , 1992 .

[27]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[28]  Ping Li,et al.  Computationally Efficient Estimators for Dimension Reductions Using Stable Random Projections , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[29]  Dana Ron,et al.  On Testing Expansion in Bounded-Degree Graphs , 2000, Studies in Complexity and Cryptography.

[30]  Paul Valiant,et al.  Estimating the Unseen , 2013, NIPS.

[31]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[32]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  T. Cai,et al.  Testing composite hypotheses, Hermite polynomials and optimal estimation of a nonsmooth functional , 2011, 1105.3039.

[34]  A. Carlton On the bias of information estimates. , 1969 .

[35]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[36]  Seshadhri Comandur,et al.  Testing Expansion in Bounded Degree Graphs , 2007, Electron. Colloquium Comput. Complex..

[37]  T. Tony Cai,et al.  Nonquadratic estimators of a quadratic functional , 2005 .

[38]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[39]  Ping Li,et al.  A New Algorithm for Compressed Counting with Applications in Shannon Entropy Estimation in Dynamic Data , 2011, COLT.

[40]  Alon Orlitsky,et al.  On Modeling Profiles Instead of Values , 2004, UAI.

[41]  Kenneth Ward Church,et al.  One sketch for all: Theory and Application of Conditional Random Sampling , 2008, NIPS.

[42]  Alon Orlitsky,et al.  Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Discrete Distributions , 2020, ArXiv.

[43]  O. Szâsz Generalization of S. Bernstein's Polynomials to the Infinite Interval , 1950 .

[44]  Alon Orlitsky,et al.  Unified Sample-Optimal Property Estimation in Near-Linear Time , 2019, NeurIPS.

[45]  Derek Hoiem,et al.  Building text features for object image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[47]  D. Mcneil Estimating an Author's Vocabulary , 1973 .

[48]  Clément L. Canonne,et al.  A Survey on Distribution Testing: Your Data is Big. But is it Blue? , 2020, Electron. Colloquium Comput. Complex..

[49]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[50]  T. Sejnowski,et al.  Reliability of spike timing in neocortical neurons. , 1995, Science.

[51]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[52]  Ronitt Rubinfeld,et al.  Testing Non-uniform k-Wise Independent Distributions over Product Spaces , 2010, ICALP.

[53]  Himanshu Tyagi,et al.  Estimating Renyi Entropy of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[54]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[55]  A. Chao Species Estimation and Applications , 2006 .

[56]  Kerstin Vogler,et al.  Table Of Integrals Series And Products , 2016 .

[57]  Francesca Odone,et al.  Histogram intersection kernel for image classification , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[58]  Bingqing Li,et al.  A Class of New Metrics Based on Triangular Discrimination , 2015, Inf..

[59]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[60]  Yingbin Liang,et al.  Estimation of KL Divergence: Optimal Minimax Rate , 2016, IEEE Transactions on Information Theory.

[61]  Gregory Valiant,et al.  Instance optimal learning of discrete distributions , 2016, STOC.

[62]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[63]  F. Chung,et al.  Complex Graphs and Networks , 2006 .

[64]  Jayadev Acharya,et al.  Profile Maximum Likelihood is Optimal for Estimating KL Divergence , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[65]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[66]  Gennady Samorodnitsky,et al.  Sign Cauchy Projections and Chi-Square Kernel , 2013, NIPS.

[67]  Daniel M. Kane,et al.  A New Approach for Testing Properties of Discrete Distributions , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[68]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[69]  Yanjun Han,et al.  Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance , 2018, COLT.

[70]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[71]  Todd P. Coleman,et al.  Efficient Methods to Compute Optimal Tree Approximations of Directed Information Graphs , 2013, IEEE Transactions on Signal Processing.

[72]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[73]  D. Relman,et al.  Bacterial diversity within the human subgingival crevice. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[74]  I. J. Taneja Bounds On Triangular Discrimination, Harmonic Mean and Symmetric Chi-square Divergences , 2005, math/0505238.

[75]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[76]  Richard Szeliski,et al.  Computer Vision - Algorithms and Applications , 2011, Texts in Computer Science.

[77]  Nozha Boujemaa,et al.  Generalized histogram intersection kernel for image recognition , 2005, IEEE International Conference on Image Processing 2005.

[78]  Trevor Hastie,et al.  A Unified Near-Optimal Estimator For Dimension Reduction in l_α(0 , 2007, NIPS 2007.

[79]  Moses Charikar,et al.  Efficient profile maximum likelihood for universal symmetric property estimation , 2019, STOC.

[80]  Wulfram Gerstner,et al.  SPIKING NEURON MODELS Single Neurons , Populations , Plasticity , 2002 .

[81]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[82]  Ping Li,et al.  Very sparse stable random projections for dimension reduction in lα (0 <α ≤ 2) norm , 2007, KDD '07.

[83]  Ronitt Rubinfeld,et al.  Testing Properties of Collections of Distributions , 2013, Theory Comput..

[84]  Robert K. Colwell,et al.  Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages , 2012 .

[85]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[86]  Constantinos Daskalakis,et al.  Optimal Testing for Properties of Distributions , 2015, NIPS.

[87]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .