Testing Shape Restrictions of Discrete Distributions

AbstractWe study the question of testing structured properties (classes) of discrete distributions. Specifically, given sample access to an arbitrary distribution D over [n] and a property P$\mathcal {P}$, the goal is to distinguish between D ∈ P$\mathcal {P}$ and ℓ1(D, P$\mathcal {P}$) > ε. We develop a general algorithm for this question, which applies to a large range of “shape-constrained” properties, including monotone, log-concave, t-modal, piecewise-polynomial, and Poisson Binomial distributions. Moreover, for all cases considered, our algorithm has near-optimal sample complexity with regard to the domain size and is computationally efficient. For most of these classes, we provide the first non-trivial tester in the literature. In addition, we also describe a generic method to prove lower bounds for this problem, and use it to show our upper bounds are nearly tight. Finally, we extend some of our techniques to tolerant testing, deriving nearly–tight upper and lower bounds for the corresponding questions.

[1]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[2]  J. Keilson,et al.  Some Results for Discrete Unimodality , 1971 .

[3]  Constantinos Daskalakis,et al.  Optimal Testing for Properties of Distributions , 2015, NIPS.

[4]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[5]  Dana Ron,et al.  On Testing Expansion in Bounded-Degree Graphs , 2000, Studies in Complexity and Cryptography.

[6]  Gregory Valiant,et al.  A CLT and tight lower bounds for estimating entropy , 2010, Electron. Colloquium Comput. Complex..

[7]  A. Fielding,et al.  Statistical Inference Under Order Restrictions. The Theory and Application of Isotonic Regression , 1974 .

[8]  Ilias Diakonikolas,et al.  Sample-Optimal Density Estimation in Nearly-Linear Time , 2015, SODA.

[9]  Ryan O'Donnell,et al.  Learning Sums of Independent Integer Random Variables , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[10]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[11]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[12]  Rocco A. Servedio,et al.  Learning mixtures of structured distributions over discrete domains , 2012, SODA.

[13]  Clément L. Canonne,et al.  A Survey on Distribution Testing: Your Data is Big. But is it Blue? , 2020, Electron. Colloquium Comput. Complex..

[14]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[15]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[16]  P. Sen,et al.  Constrained Statistical Inference: Inequality, Order, and Shape Restrictions , 2001 .

[17]  Daniel M. Kane,et al.  Testing Identity of Structured Distributions , 2014, SODA.

[18]  Rocco A. Servedio,et al.  Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms , 2014, NIPS.

[19]  Daniel M. Kane,et al.  Optimal Learning via the Fourier Transform for Sums of Independent Integer Random Variables , 2015, COLT.

[20]  G. Walther Inference and Modeling with Log-concave Distributions , 2009, 1010.0305.

[21]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[22]  P. Hougaard Survival models for heterogeneous populations derived from stable distributions , 1986 .

[23]  Gregory Valiant,et al.  An Automatic Inequality Prover and Instance Optimal Identity Testing , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[24]  P. Rousseeuw,et al.  Wiley Series in Probability and Mathematical Statistics , 2005 .

[25]  Ronitt Rubinfeld,et al.  Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[26]  Daniel M. Kane,et al.  Nearly Optimal Learning and Sparse Covers for Sums of Independent Integer Random Variables , 2015, ArXiv.

[27]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[28]  Ilias Diakonikolas,et al.  Optimal Algorithms for Testing Closeness of Discrete Distributions , 2013, SODA.

[29]  Daniel M. Kane,et al.  A New Approach for Testing Properties of Discrete Distributions , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[30]  Rocco A. Servedio,et al.  Testing k-Modal Distributions: Optimal Algorithms via Reductions , 2011, SODA.

[31]  Rocco A. Servedio,et al.  Learning k-Modal Distributions via Testing , 2011, Theory Comput..

[32]  Noga Alon,et al.  Testing k-wise and almost k-wise independence , 2007, STOC '07.

[33]  M. An Log-Concave Probability Distributions: Theory and Statistical Testing , 1996 .

[34]  Rocco A. Servedio,et al.  Testing probability distributions using conditional samples , 2012, Electron. Colloquium Comput. Complex..

[35]  Rocco A. Servedio,et al.  Learning Poisson Binomial Distributions , 2011, STOC '12.

[36]  Eldar Fischer,et al.  Improving and extending the testing of distributions for shape-restricted properties , 2017, STACS.

[37]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[38]  Rocco A. Servedio,et al.  Explorer Efficient Density Estimation via Piecewise Polynomial Approximation , 2013 .

[39]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[40]  H. D. Brunk,et al.  Statistical inference under order restrictions : the theory and application of isotonic regression , 1973 .

[41]  Constantinos Daskalakis,et al.  Testing Poisson Binomial Distributions , 2014, SODA.

[42]  Christopher G. Small,et al.  Likelihood methods for the discrimination problem , 1986 .

[43]  Sanjeev Arora,et al.  Fitting algebraic curves to noisy data , 2002, STOC '02.

[44]  Daniel M. Kane,et al.  Efficient Robust Proper Learning of Log-concave Distributions , 2016, ArXiv.

[45]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[46]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[47]  Eldar Fischer,et al.  On the Power of Conditional Samples in Distribution Testing , 2016, SIAM J. Comput..

[48]  Ilias Diakonikolas,et al.  Learning Structured Distributions , 2016, Handbook of Big Data.

[49]  Dana Ron,et al.  Algorithmic and Analysis Techniques in Property Testing , 2010, Found. Trends Theor. Comput. Sci..

[50]  B. Mandelbrot New Methods in Statistical Economics , 1963, Journal of Political Economy.

[51]  Debasis Sengupta,et al.  Log-concave and concave distributions in reliability , 1999 .

[52]  Dana Ron,et al.  Property Testing: A Learning Theory Perspective , 2007, COLT.

[53]  C. Tsallis,et al.  Statistical-mechanical foundation of the ubiquity of Lévy distributions in Nature. , 1995, Physical review letters.

[54]  Ronitt Rubinfeld Taming big probability distributions , 2012, XRDS.

[55]  Daniel M. Kane,et al.  Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[56]  Clément L. Canonne Are Few Bins Enough: Testing Histogram Distributions , 2016, PODS.

[57]  M. Bagnoli,et al.  Log-concave probability and its applications , 2004 .