Optimal Algorithms for Testing Closeness of Discrete Distributions

We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions p and q over an n-element set, we wish to distinguish whether p = q versus p is at least e-far from q, in either e1 or e2 distance. Batu et al [BFR+00, BFR+13] gave the first sub-linear time algorithms for these problems, which matched the lower bounds of [Val11] up to a logarithmic factor in n, and a polynomial factor of e. In this work, we present simple testers for both the e1 and e2 settings, with sample complexity that is information-theoretically optimal, to constant factors, both in the dependence on n, and the dependence on e for the e1 testing problem we establish that the sample complexity is Θ(max{n2/3/e4/3, n1/2/&epsilon2}).

[1]  Ronitt Rubinfeld,et al.  Testing Properties of Collections of Distributions , 2013, Theory Comput..

[2]  Alon Orlitsky,et al.  Competitive Closeness Testing , 2011, COLT.

[3]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[4]  Alon Orlitsky,et al.  25th Annual Conference on Learning Theory Competitive Classification and Closeness Testing , 2022 .

[5]  Ronitt Rubinfeld Taming big probability distributions , 2012, XRDS.

[6]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[7]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[8]  Alexandr Andoni,et al.  External Sampling , 2009, ICALP.

[9]  Rocco A. Servedio,et al.  Testing k-Modal Distributions: Optimal Algorithms via Reductions , 2011, SODA.

[10]  Krzysztof Onak Testing Distribution Identity Efficiently , 2009, ArXiv.

[11]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[12]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[13]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[14]  Ronitt Rubinfeld,et al.  Sublinear Time Algorithms for Earth Mover’s Distance , 2009, Theory of Computing Systems.

[15]  Gregory Valiant,et al.  The Power of Linear Estimators , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[16]  Tugkan Batu Testing Properties of Distributions , 2001 .

[17]  Ronitt Rubinfeld,et al.  Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[18]  Gregory Valiant,et al.  Instance-by-instance optimal identity testing , 2013, Electron. Colloquium Comput. Complex..

[19]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[20]  Seshadhri Comandur,et al.  Testing Expansion in Bounded Degree Graphs , 2007, Electron. Colloquium Comput. Complex..

[21]  Ronitt Rubinfeld,et al.  Testing Closeness of Discrete Distributions , 2010, JACM.