A Survey on Distribution Testing: Your Data is Big. But is it Blue?

The field of property testing originated in work on program checking, and has evolved into an established and very active research area. In this work, we survey the developments of one of its most recent and prolific offsprings, distribution testing. This subfield, at the junction of property testing and Statistics, is concerned with studying properties of probability distributions. We cover the current status of distribution testing in several settings, starting with the traditional sampling model where the algorithm obtains independent samples from the distribution. We then discuss different recent models, which either grant the testing algorithms more powerful types of queries, or evaluate their performance against that of an information-theoretically optimal “adversary” (for a given number of samples). In each setting, we describe the state of the art for a variety of testing problems. We hope this survey will serve as a self-contained introduction for those considering research in this field. ∗Research supported by NSF CCF-1115703 and NSF CCF-1319788. ACM Classification: G.3, F.2.2 AMS Classification: 68Q32, 68W20, 68Q17, 68Q87

[1]  Clément L. Canonne,et al.  Distribution Testing Lower Bounds via Reductions from Communication Complexity , 2017, Computational Complexity Conference.

[2]  Dana Ron,et al.  Property Testing: A Learning Theory Perspective , 2007, COLT.

[3]  Ronitt Rubinfeld,et al.  Learning and Testing Junta Distributions , 2016, COLT.

[4]  J. Adell,et al.  Exact Kolmogorov and total variation distances between some familiar discrete distributions , 2006 .

[5]  Dana Ron,et al.  Algorithmic and Analysis Techniques in Property Testing , 2010, Found. Trends Theor. Comput. Sci..

[6]  Seshadhri Comandur,et al.  Testing Expansion in Bounded Degree Graphs , 2007, Electron. Colloquium Comput. Complex..

[7]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[8]  Ronitt Rubinfeld,et al.  Robust Characterizations of Polynomials with Applications to Program Testing , 1996, SIAM J. Comput..

[9]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[10]  Grace L. Yang,et al.  Festschrift for Lucien Le Cam , 1997 .

[11]  L. Reboul,et al.  Estimation of a function under shape restrictions. Applications to reliability , 2005, math/0507427.

[12]  Ilias Diakonikolas,et al.  Optimal Algorithms for Testing Closeness of Discrete Distributions , 2013, SODA.

[13]  Daniel M. Kane,et al.  A New Approach for Testing Properties of Discrete Distributions , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[14]  Rocco A. Servedio,et al.  Testing k-Modal Distributions: Optimal Algorithms via Reductions , 2011, SODA.

[15]  Ronitt Rubinfeld,et al.  Sampling Correctors , 2015, ITCS.

[16]  Oded Goldreich On the Communication Complexity Methodology for Proving Lower Bounds on the Query Complexity of Property Testing , 2013, Electron. Colloquium Comput. Complex..

[17]  Siméon-Denis Poisson Recherches sur la probabilité des jugements en matière criminelle et en matiére civile, précédées des règles générales du calcul des probabilités , 1837 .

[18]  C. Papadimitriou,et al.  Algorithmic Approaches to Statistical Questions , 2012 .

[19]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[20]  Clément L. Canonne,et al.  Alice and Bob Show Distribution Testing Lower Bounds (They don't talk to each other anymore.) , 2016, Electron. Colloquium Comput. Complex..

[21]  H. Scheffé A Useful Convergence Theorem for Probability Distributions , 1947 .

[22]  Gregory Valiant,et al.  An Automatic Inequality Prover and Instance Optimal Identity Testing , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[23]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[24]  Oded Goldreich The uniform distribution is complete with respect to testing identity to a fixed distribution , 2016, Electron. Colloquium Comput. Complex..

[25]  Noga Alon,et al.  Testing k-wise and almost k-wise independence , 2007, STOC '07.

[26]  Constantinos Daskalakis,et al.  Optimal Testing for Properties of Distributions , 2015, NIPS.

[27]  Alon Orlitsky,et al.  Competitive Closeness Testing , 2011, COLT.

[28]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[29]  P. Assouad Deux remarques sur l'estimation , 1983 .

[30]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[31]  Tsachy Weissman,et al.  Order-Optimal Estimation of Functionals of Discrete Distributions , 2014, ArXiv.

[32]  Daniel M. Kane,et al.  Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[33]  Venkatesh Medabalimi Property Testing Lower Bounds via Communication Complexity , 2012 .

[34]  Alon Orlitsky,et al.  25th Annual Conference on Learning Theory Competitive Classification and Closeness Testing , 2022 .

[35]  Gregory Valiant,et al.  A CLT and tight lower bounds for estimating entropy , 2010, Electron. Colloquium Comput. Complex..

[36]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2009, SIAM J. Comput..

[37]  Daniel M. Kane,et al.  Testing Identity of Structured Distributions , 2014, SODA.

[38]  Bo Waggoner,et al.  Lp Testing and Learning of Discrete Distributions , 2014, ITCS.

[39]  L. L. Cam,et al.  Asymptotic Methods In Statistical Decision Theory , 1986 .

[40]  Rocco A. Servedio,et al.  Learning Poisson Binomial Distributions , 2011, STOC '12.

[41]  Ryan O'Donnell,et al.  Quantum Spectrum Testing , 2015, Communications in Mathematical Physics.

[42]  L. Lecam Convergence of Estimates Under Dimensionality Restrictions , 1973 .

[43]  E. Fischer THE ART OF UNINFORMED DECISIONS: A PRIMER TO PROPERTY TESTING , 2004 .

[44]  Ronald de Wolf,et al.  A Survey of Quantum Property Testing , 2013, Theory Comput..

[45]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[46]  Rocco A. Servedio,et al.  Learning mixtures of structured distributions over discrete domains , 2012, SODA.

[47]  Ronitt Rubinfeld,et al.  Testing Non-uniform k-Wise Independent Distributions over Product Spaces , 2010, ICALP.

[48]  Alon Orlitsky,et al.  A Competitive Test for Uniformity of Monotone Distributions , 2013, AISTATS.

[49]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[50]  Clément L. Canonne Are Few Bins Enough: Testing Histogram Distributions , 2016, PODS.

[51]  Clément L. Canonne,et al.  A Chasm Between Identity and Equivalence Testing with Conditional Queries , 2014, APPROX-RANDOM.

[52]  Oded Goldreich,et al.  Property Testing - Current Research and Surveys , 2010, Property Testing.

[53]  Alon Orlitsky,et al.  Faster Algorithms for Testing under Conditional Sampling , 2015, COLT.

[54]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[55]  Dana Ron,et al.  On Testing Expansion in Bounded-Degree Graphs , 2000, Studies in Complexity and Cryptography.

[56]  Bo Waggoner,et al.  ℓp Testing and Learning of Discrete Distributions , 2014, ArXiv.

[57]  M. An Log-Concave Probability Distributions: Theory and Statistical Testing , 1996 .

[58]  Rocco A. Servedio,et al.  Testing probability distributions using conditional samples , 2012, Electron. Colloquium Comput. Complex..

[59]  SahaiAmit,et al.  A complete problem for statistical zero knowledge , 2003 .

[60]  Ronitt Rubinfeld,et al.  Testing Similar Means , 2014, SIAM J. Discret. Math..

[61]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[62]  Himanshu Tyagi,et al.  Estimating Renyi Entropy of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[63]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[64]  Ronitt Rubinfeld,et al.  Tolerant property testing and distance approximation , 2006, J. Comput. Syst. Sci..

[65]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[66]  Ronitt Rubinfeld,et al.  Testing Shape Restrictions of Discrete Distributions , 2015, Theory of Computing Systems.

[67]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[68]  Ronitt Rubinfeld,et al.  Sublinear Time Algorithms for Earth Mover’s Distance , 2009, Theory of Computing Systems.

[69]  Yael Tauman Kalai,et al.  Extractors and the Leftover Hash Lemma , 2011 .

[70]  L. L. Cam,et al.  An approximation theorem for the Poisson binomial distribution. , 1960 .

[71]  Ronitt Rubinfeld,et al.  Testing Closeness of Discrete Distributions , 2010, JACM.

[72]  Ronitt Rubinfeld,et al.  Testing monotonicity of distributions over general partial orders , 2011, ICS.

[73]  Ronitt Rubinfeld,et al.  Testing Properties of Collections of Distributions , 2013, Theory Comput..

[74]  Gregory Valiant,et al.  Testing Closeness With Unequal Sized Samples , 2015, NIPS.

[75]  Ronitt Rubinfeld,et al.  Testing Probability Distributions Underlying Aggregated Data , 2014, ICALP.

[76]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[77]  Yanjun Han,et al.  Minimax estimation of the L1 distance , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[78]  Yuxin Deng,et al.  The Kantorovich Metric in Computer Science: A Brief Survey , 2009, QAPL.

[79]  Constantinos Daskalakis,et al.  Testing Poisson Binomial Distributions , 2014, SODA.

[80]  Ronitt Rubinfeld,et al.  Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[81]  Clément L. Canonne Big Data on the Rise? - Testing Monotonicity of Distributions , 2015, ICALP.

[82]  Ilias Diakonikolas,et al.  Collision-based Testers are Optimal for Uniformity and Closeness , 2016, Electron. Colloquium Comput. Complex..

[83]  Rocco A. Servedio,et al.  Explorer Efficient Density Estimation via Piecewise Polynomial Approximation , 2013 .

[84]  Wojciech Szpankowski,et al.  Analytic Poissonization and Depoissonization , 2011 .

[85]  Artur Czumaj,et al.  Testing Monotone Continuous Distributions on High-Dimensional Real Cubes , 2010, Property Testing.

[86]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[87]  Himanshu Tyagi,et al.  The Complexity of Estimating Rényi Entropy , 2015, SODA.

[88]  Oded Goldreich,et al.  On the complexity of computational problems regarding distributions (a survey) , 2011, Electron. Colloquium Comput. Complex..

[89]  Eldar Fischer,et al.  On the power of conditional samples in distribution testing , 2013, ITCS '13.

[90]  L. Birge On the Risk of Histograms for Estimating Decreasing Densities , 1987 .

[91]  Shang‐keng Ma Calculation of entropy from data of motion , 1981 .

[92]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[94]  G. Walther Inference and Modeling with Log-concave Distributions , 2009, 1010.0305.

[95]  Ryan O'Donnell,et al.  Optimal Bounds for Estimating Entropy with PMF Queries , 2015, MFCS.

[96]  Rocco A. Servedio,et al.  Testing equivalence between distributions using conditional samples , 2014, SODA.

[97]  P. Glynn Upper bounds on Poisson tail probabilities , 1987 .

[98]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[99]  Alon Orlitsky,et al.  Sublinear algorithms for outlier detection and generalized closeness testing , 2014, 2014 IEEE International Symposium on Information Theory.

[100]  Ryan O'Donnell,et al.  Learning Sums of Independent Integer Random Variables , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[101]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[102]  Rocco A. Servedio,et al.  Learning k-Modal Distributions via Testing , 2012, Theory Comput..

[103]  Gregory Valiant,et al.  Estimating the unseen: A sublinear-sample canonical estimator of distributions , 2010, Electron. Colloquium Comput. Complex..