Testing Properties of Collections of Distributions

We propose a framework for studying property testing of collections of distributions, where the number of distributions in the collection is a parameter of the problem. Previous work on property testing of distributions considered single distributions or pairs of distributions. We suggest two models that differ in the way the algorithm is given access to samples from the distributions. In one model the algorithm may ask for a sample from any distribution of its choice, and in the other the choice of the distribution is random. Our main focus is on the basic problem of distinguishing between the case that all the distributions in the collection are the same (or very similar), and the case that it is necessary to modify the distributions in the collection in a non-negligible manner so as to obtain this property. We give almost tight upper and lower bounds for this testing problem, as well as study an extension to a clusterability property. One of our lower bounds directly implies a lower bound on testing independence of a joint distribution, a result which was left open by previous work.

[1]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[2]  Donald E. Knuth The Art of Computer Programming 2 / Seminumerical Algorithms , 1971 .

[3]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[4]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2009, SIAM J. Comput..

[5]  William Bialek,et al.  Entropy and information in neural spike trains: progress on the sampling problem. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[6]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[7]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[8]  Eldar Fischer,et al.  Testing graph isomorphism , 2006, SODA '06.

[9]  William Bialek,et al.  Entropy and Information in Neural Spike Trains , 1996, cond-mat/9603127.

[10]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[11]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[12]  Kenji Yamanishi,et al.  Probably Almost Discriminative Learning , 2004, Machine Learning.

[13]  Shang‐keng Ma Calculation of entropy from data of motion , 1981 .

[14]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[15]  Rafail Ostrovsky,et al.  Extracting Correlations , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[16]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[17]  Seshadhri Comandur,et al.  An Expansion Tester for Bounded Degree Graphs , 2011, SIAM J. Comput..

[18]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[19]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[20]  Rafail Ostrovsky,et al.  Measuring independence of datasets , 2009, STOC '10.

[21]  David R. Wolf,et al.  Estimating functions of probability distributions from a finite set of samples. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[22]  R. Servedio,et al.  Testing monotone high-dimensional distributions , 2009 .

[23]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[24]  Ronitt Rubinfeld,et al.  Robust characterizations of k-wise independence over product spaces and related testing results , 2013, Random Struct. Algorithms.

[25]  Asaf Shapira,et al.  Testing the expansion of a graph , 2010, Inf. Comput..

[26]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[27]  Ronitt Rubinfeld,et al.  Testing Non-uniform k-Wise Independent Distributions over Product Spaces , 2010, ICALP.

[28]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[29]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[30]  Seshadhri Comandur,et al.  Testing Expansion in Bounded Degree Graphs , 2007, Electron. Colloquium Comput. Complex..

[31]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[32]  Artur Czumaj,et al.  Testing Monotone Continuous Distributions on High-Dimensional Real Cubes , 2010, Property Testing.

[33]  Dana Ron,et al.  On Testing Expansion in Bounded-Degree Graphs , 2000, Studies in Complexity and Cryptography.

[34]  Ronitt Rubinfeld,et al.  Testing Closeness of Discrete Distributions , 2010, JACM.

[35]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[36]  Noga Alon,et al.  Testing of Clustering , 2003, SIAM J. Discret. Math..

[37]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[38]  Piotr Indyk,et al.  Declaring independence via the sketching of sketches , 2008, SODA '08.

[39]  B. Roos On the Rate of Multivariate Poisson Convergence , 1999 .

[40]  B. Harris The Statistical Estimation of Entropy in the Non-Parametric Case , 1975 .

[41]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[42]  Sudipto Guha,et al.  Sublinear estimation of entropy and information distances , 2009, TALG.

[43]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[44]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[45]  Noga Alon,et al.  Testing k-wise and almost k-wise independence , 2007, STOC '07.

[46]  Alexandr Andoni,et al.  External Sampling , 2009, ICALP.

[47]  Ronitt Rubinfeld,et al.  Testing monotonicity of distributions over general partial orders , 2011, ICS.

[48]  Rafail Ostrovsky,et al.  Measuring $k$-Wise Independence of Streaming Data , 2008, ArXiv.

[49]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[50]  Artur Czumaj,et al.  Testing monotone continuous distributions on high-dimensional real cubes , 2010, SODA '10.

[51]  Tugkan Batu Testing Properties of Distributions , 2001 .

[52]  Ronitt Rubinfeld,et al.  Sublinear Algorithms for Approximating String Compressibility , 2007, Algorithmica.

[53]  R. Ostrovsky,et al.  Zero-one frequency laws , 2010, STOC '10.

[54]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[55]  Ronitt Rubinfeld,et al.  Sublinear Time Algorithms for Earth Mover’s Distance , 2009, Theory of Computing Systems.