Telling Two Distributions Apart: a Tight Characterization

We consider the problem of distinguishing between two arbitrary black-box distributions defined over the domain [n], given access to $s$ samples from both. It is known that in the worst case O(n^{2/3}) samples is both necessary and sufficient, provided that the distributions have L1 difference of at least {\epsilon}. However, it is also known that in many cases fewer samples suffice. We identify a new parameter, that provides an upper bound on how many samples needed, and present an efficient algorithm that requires the number of samples independent of the domain size. Also for a large subclass of distributions we provide a lower bound, that matches our upper bound up to a poly-logarithmic factor.

[1]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[2]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[3]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[4]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[5]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[6]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[7]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[8]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[9]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.