Testing that distributions are close

Given two distributions over an n element set, we wish to check whether these distributions are statistically close by only sampling. We give a sublinear algorithm which uses O(n/sup 2/3//spl epsiv//sup -4/ log n) independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases when the distance between the distributions is small (less than max(/spl epsiv//sup 2//32/sup 3//spl radic/n,/spl epsiv//4/spl radic/n=)) or large (more than /spl epsiv/) in L/sub 1/-distance. We also give an /spl Omega/(n/sup 2/3//spl epsiv//sup -2/3/) lower bound. Our algorithm has applications to the problem of checking whether a given Markov process is rapidly mixing. We develop sublinear algorithms for this problem as well.

[1]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[2]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[3]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[4]  Alastair J. Walker,et al.  An Efficient Method for Generating Discrete Random Variables with General Distributions , 1977, TOMS.

[5]  B. Parlett The Symmetric Eigenvalue Problem , 1981 .

[6]  N. Alon Eigenvalues and expanders , 1986, Comb..

[7]  Mark Jerrum,et al.  Approximate Counting, Uniform Generation and Rapidly Mixing Markov Chains , 1987, International Workshop on Graph-Theoretic Concepts in Computer Science.

[8]  N. Cressie,et al.  Design considerations for neyman-pearson and wald hypothesis testing , 1989 .

[9]  Sampath Kannan,et al.  Program Checkers for Probability Generation , 1991, ICALP.

[10]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Kenji Yamanishi Probably almost discriminative learning , 1992, COLT '92.

[13]  Ravi Kannan,et al.  Markov chains and polynomial time algorithms , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[14]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[15]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[16]  Ronitt Rubinfeld,et al.  Robust Characterizations of Polynomials with Applications to Program Testing , 1996, SIAM J. Comput..

[17]  Dana Ron,et al.  Property Testing in Bounded Degree Graphs , 2002, STOC '97.

[18]  Amit Sahai,et al.  A complete promise problem for statistical zero-knowledge , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[19]  Ronitt Rubinfeld,et al.  Spot-checkers , 1998, STOC '98.

[20]  Noga Alon,et al.  Efficient Testing of Large Graphs , 2000, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[21]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[22]  Alan M. Frieze,et al.  Quick Approximation to Matrices and Applications , 1999, Comb..

[23]  Dana Ron,et al.  Testing the diameter of graphs , 1999, RANDOM-APPROX.

[24]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[25]  Ravi Kumar,et al.  Sampling algorithms: lower bounds and applications , 2001, STOC '01.

[26]  Luca Trevisan,et al.  Three Theorems regarding Testing Graph Properties , 2001, Electron. Colloquium Comput. Complex..

[27]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[28]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..