Data Amplification: A Unified and Competitive Approach to Property Estimation

Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n\sqrt{\log n} samples. This provides off-the-shelf, distribution-independent, ``amplification'' of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n\log n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.

[1]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[2]  I. Ionita-Laza,et al.  Estimating the number of unseen variants in the human genome , 2009, Proceedings of the National Academy of Sciences.

[3]  Tugkan Batu,et al.  Generalized Uniformity Testing , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[4]  G. Crooks On Measures of Entropy and Information , 2015 .

[5]  A. Suresh,et al.  Optimal prediction of the number of unseen species , 2016, Proceedings of the National Academy of Sciences.

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[8]  G. Belle,et al.  Nonparametric estimation of species richness , 1984 .

[9]  Alon Orlitsky,et al.  Adaptive Estimation of Generalized Distance to Uniformity , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[10]  Yanjun Han,et al.  Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance , 2018, COLT.

[11]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[12]  V. Ivanov,et al.  Exact Constants in Approximation Theory , 1991 .

[13]  Jorge Bustamante,et al.  Bernstein Operators and Their Properties , 2017 .

[14]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[15]  Clément L. Canonne,et al.  A Survey on Distribution Testing: Your Data is Big. But is it Blue? , 2020, Electron. Colloquium Comput. Complex..

[16]  A. Timan Theory of Approximation of Functions of a Real Variable , 1994 .

[17]  A. Carlton On the bias of information estimates. , 1969 .

[18]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[19]  Himanshu Tyagi,et al.  Estimating Renyi Entropy of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[20]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[21]  F. Chung,et al.  Complex Graphs and Networks , 2006 .

[22]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[23]  Robert K. Colwell,et al.  Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages , 2012 .

[24]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[25]  Yihong Wu,et al.  Chebyshev polynomials, moment matching, and optimal estimation of the unseen , 2015, The Annals of Statistics.

[26]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[27]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[28]  Alon Orlitsky,et al.  On Learning Distributions from their Samples , 2015, COLT.

[29]  A. Chao Species Estimation and Applications , 2006 .

[30]  Alon Orlitsky,et al.  A Unified Maximum Likelihood Approach for Estimating Symmetric Properties of Discrete Distributions , 2017, ICML.

[31]  L. Milne‐Thomson A Treatise on the Theory of Bessel Functions , 1945, Nature.

[32]  A. Chao Nonparametric estimation of the number of classes in a population , 1984 .

[33]  D. Mcneil Estimating an Author's Vocabulary , 1973 .

[34]  Carl-Erik Särndal,et al.  Model Assisted Survey Sampling , 1997 .

[35]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.