Testing Mixtures of Discrete Distributions

There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size $n$, distinguishing the uniform distribution from distributions that are far from uniform in $\ell_1$-distance uses only $O(\sqrt{n})$ samples. However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is $\epsilon$-close to uniform from the case where the distribution is $(1-\epsilon)$-far from uniform. The latter task requires nearly linear in $n$ samples [Valiant 2008, Valian and Valiant 2011]. In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known a priori. Focusing on the identity and closeness testing problems leads to the following mixture testing question: Given samples of distributions $p, q_1,q_2$, can we test if $p$ is a mixture of $q_1$ and $q_2$? We consider this general question in various scenarios that differ in terms of how the tester can access the distributions, and show that indeed this problem is more tractable. Our results show that the sample complexity of our testers are exactly the same as for the classical non-mixture case.

[1]  Ronitt Rubinfeld,et al.  Learning and Testing Junta Distributions , 2016, COLT.

[2]  Daniel M. Kane,et al.  Sharp Bounds for Generalized Uniformity Testing , 2017, Electron. Colloquium Comput. Complex..

[3]  Clément L. Canonne Are Few Bins Enough: Testing Histogram Distributions , 2016, PODS.

[4]  Ronitt Rubinfeld,et al.  Testing Shape Restrictions of Discrete Distributions , 2015, Theory of Computing Systems.

[5]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[6]  Daniel M. Kane,et al.  Testing Bayesian Networks , 2016, IEEE Transactions on Information Theory.

[7]  Avrim Blum,et al.  Active Tolerant Testing , 2018, COLT.

[8]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[9]  Ronitt Rubinfeld,et al.  Testing Properties of Collections of Distributions , 2013, Theory Comput..

[10]  Ronitt Rubinfeld,et al.  Sublinear algorithms for testing monotone and unimodal distributions , 2004, STOC '04.

[11]  Constantinos Daskalakis,et al.  Square Hellinger Subadditivity for Bayesian Networks and its Applications to Identity Testing , 2016, COLT.

[12]  Clément L. Canonne,et al.  A Survey on Distribution Testing: Your Data is Big. But is it Blue? , 2020, Electron. Colloquium Comput. Complex..

[13]  Ilias Diakonikolas,et al.  Collision-based Testers are Optimal for Uniformity and Closeness , 2016, Electron. Colloquium Comput. Complex..

[14]  Constantinos Daskalakis,et al.  Optimal Testing for Properties of Distributions , 2015, NIPS.

[15]  Ilias Diakonikolas,et al.  Optimal Algorithms for Testing Closeness of Discrete Distributions , 2013, SODA.

[16]  Ronitt Rubinfeld Taming big probability distributions , 2012, XRDS.

[17]  Daniel M. Kane,et al.  Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[18]  Daniel M. Kane,et al.  A New Approach for Testing Properties of Discrete Distributions , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[19]  Rocco A. Servedio,et al.  Testing k-Modal Distributions: Optimal Algorithms via Reductions , 2011, SODA.

[20]  Ronitt Rubinfeld,et al.  Approximating and testing k-histogram distributions in sub-linear time , 2012, PODS '12.

[21]  Ronitt Rubinfeld,et al.  Towards Testing Monotonicity of Distributions Over General Posets , 2019, COLT.

[22]  Alon Orlitsky,et al.  Faster Algorithms for Testing under Conditional Sampling , 2015, COLT.

[23]  Dana Ron,et al.  On Testing Expansion in Bounded-Degree Graphs , 2000, Studies in Complexity and Cryptography.

[24]  Gregory Valiant,et al.  An Automatic Inequality Prover and Instance Optimal Identity Testing , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[25]  Tugkan Batu Testing Properties of Distributions , 2001 .

[26]  Alon Orlitsky,et al.  Sublinear algorithms for outlier detection and generalized closeness testing , 2014, 2014 IEEE International Symposium on Information Theory.

[27]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[28]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[29]  Ronitt Rubinfeld,et al.  Testing Closeness of Discrete Distributions , 2010, JACM.

[30]  Ilias Diakonikolas,et al.  Testing for Families of Distributions via the Fourier Transform , 2018, NeurIPS.

[31]  Ronitt Rubinfeld,et al.  The complexity of approximating entropy , 2002, STOC '02.

[32]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[33]  Daniel M. Kane,et al.  Testing Identity of Structured Distributions , 2014, SODA.