Learning and testing junta distributions over hypercubes

Many tasks related to the analysis of high-dimensional datasets can be formalized as problems involving learning or testing properties of distributions over a highdimensional domain. In this work, we initiate the study of the following general question: when many of the dimensions of the distribution correspond to "irrelevant" features in the associated dataset, can we learn the distribution efficiently? We formalize this question with the notion of junta distribution. The distribution D over {0, 1}" is a k-junta distribution if the probability mass function p of D is a k-juntai.e., if there is a set J C [n] of at most k coordinates such that for every x c {0, 1}7, the value of p(x) is completely determined by the value of x on the coordinates in J. We show that it is possible to learn k-junta distributions with a number of samples that depends only logarithmically on the total number n of dimensions. We give two proofs of this result; one using the cover method and one by developing a Fourierbased learning algorithm inspired by the Low-Degree Algorithm of Linial, Mansour, and Nisan (1993). We also consider the problem of testing whether an unknown distribution is a k-junta distribution. We introduce an algorithm for this task with sample complexity O(2'/ 2 k) and show that this bound is nearly optimal for constant values of k. As a byproduct of the analysis of the algorithm, we obtain an optimal bound on the number of samples required to test a weighted collection of distribution for uniformity. Finally, we establish the sample complexity for learning and testing other classes of distributions related to junta-distributions. Notably, we show that the task of testing whether a distribution on {0, 1}' contains a coordinate i E [n] such that xi is drawn independently from the remaining coordinates requires 0(2 ,/ 3 ) samples. This is in contrast to the task of testing whether all of the coordinates are drawn independently from each other, which was recently shown to have sample complexity 6(2'/2) by Acharya, Daskalakis, and Kamath (2015). Thesis Supervisor: Ronitt Rubinfeld Title: Professor of Electrical Engineering and Computer Science

[1]  Dana Ron,et al.  Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[2]  Ilias Diakonikolas,et al.  Optimal Algorithms for Testing Closeness of Discrete Distributions , 2013, SODA.

[3]  Rocco A. Servedio,et al.  Quantum Algorithms for Learning and Testing Juntas , 2007, Quantum Inf. Process..

[4]  Ronitt Rubinfeld,et al.  Testing that distributions are close , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[5]  Clément L. Canonne,et al.  A Survey on Distribution Testing: Your Data is Big. But is it Blue? , 2020, Electron. Colloquium Comput. Complex..

[6]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[7]  N. Littlestone,et al.  Learning in the presence of finitely or infinitely many irrelevant attributes , 1991, COLT '91.

[8]  Rocco A. Servedio,et al.  Adaptivity Helps for Testing Juntas , 2015, CCC.

[9]  Harald Niederreiter,et al.  Probability and computing: randomized algorithms and probabilistic analysis , 2006, Math. Comput..

[10]  Noam Nisan,et al.  Constant depth circuits, Fourier transform, and learnability , 1993, JACM.

[11]  Ariel D. Procaccia,et al.  Junta Distributions and the Average-Case Complexity of Manipulating Elections , 2007, J. Artif. Intell. Res..

[12]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[13]  Dana Ron,et al.  On Testing Expansion in Bounded-Degree Graphs , 2000, Studies in Complexity and Cryptography.

[14]  Eldar Fischer,et al.  Junto-Symmetric Functions, Hypergraph Isomorphism and Crunching , 2012, 2012 IEEE 27th Conference on Computational Complexity.

[15]  Ronitt Rubinfeld,et al.  Testing Closeness of Discrete Distributions , 2010, JACM.

[16]  Daniel M. Kane,et al.  Testing Identity of Structured Distributions , 2014, SODA.

[17]  Eric Blais Testing juntas nearly optimally , 2009, STOC '09.

[18]  Sourav Chakraborty,et al.  Efficient Sample Extractors for Juntas with Applications , 2011, ICALP.

[19]  Eric Blais Improved Bounds for Testing Juntas , 2008, APPROX-RANDOM.

[20]  Gregory Valiant,et al.  Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[21]  Yuichi Yoshida,et al.  Partially Symmetric Functions Are Efficiently Isomorphism Testable , 2015, SIAM J. Comput..

[22]  Andris Ambainis,et al.  Efficient Quantum Algorithms for (Gapped) Group Testing and Junta Testing , 2015, SODA.

[23]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[24]  Maria-Florina Balcan,et al.  Active Property Testing , 2011, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[25]  Ryan O'Donnell,et al.  Learning functions of k relevant variables , 2004, J. Comput. Syst. Sci..

[26]  Rocco A. Servedio,et al.  Testing for Concise Representations , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[27]  Ryan O'Donnell,et al.  Analysis of Boolean Functions , 2014, ArXiv.

[28]  Guy Kindler,et al.  Testing juntas , 2002, J. Comput. Syst. Sci..

[29]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[30]  Rocco A. Servedio,et al.  Learning k-Modal Distributions via Testing , 2012, Theory Comput..

[31]  Ronitt Rubinfeld,et al.  Testing random variables for independence and identity , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[32]  Avrim Blum,et al.  Relevant Examples and Relevant Features: Thoughts from Computational Learning Theory , 1994 .

[33]  Hana Chockler,et al.  A lower bound for testing juntas , 2004, Inf. Process. Lett..

[34]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[35]  Jason H. Moore,et al.  Chapter 11: Genome-Wide Association Studies , 2012, PLoS Comput. Biol..

[36]  Liam Paninski,et al.  A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data , 2008, IEEE Transactions on Information Theory.

[37]  Ronitt Rubinfeld,et al.  Testing monotonicity of distributions over general partial orders , 2011, ICS.

[38]  Ronitt Rubinfeld,et al.  Testing Properties of Collections of Distributions , 2013, Theory Comput..

[39]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.

[40]  Constantinos Daskalakis,et al.  Optimal Testing for Properties of Distributions , 2015, NIPS.

[41]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.