Settling the Sample Complexity for Learning Mixtures of Gaussians

We prove that $\widetilde{\Theta}(k d^2 / \varepsilon^2)$ samples are necessary and sufficient for learning a mixture of $k$ Gaussians in $\mathbf{R}^d$, up to error $\varepsilon$ in total variation distance. This improves both the known upper bound and lower bound for this problem. For mixtures of axis-aligned Gaussians, we show that $\widetilde{O}(k d / \varepsilon^2)$ samples suffice, matching a known lower bound. Moreover, these results hold in an agnostic learning setting as well. The upper bound is based on a novel technique for distribution learning based on a notion of sample compression. Any class of distributions that allows such a sample compression scheme can also be learned with few samples. Moreover, if a class of distributions has such a compression scheme, then so do the classes of products and mixtures of those distributions. The core of our main result is showing that the class of Gaussians in $\mathbf{R}^d$ has an efficient sample compression.

[1]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[2]  M. Abramowitz,et al.  Handbook of Mathematical Functions, with Formulas, Graphs, and Mathematical Tables , 1966 .

[3]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[4]  L. Devroye A Course in Density Estimation , 1987 .

[5]  R. Reiss Approximate Distributions of Order Statistics: With Applications to Nonparametric Statistics , 1989 .

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Ronitt Rubinfeld,et al.  On the learnability of discrete distributions , 1994, STOC '94.

[8]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[9]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[10]  Luc Devroye,et al.  Combinatorial methods in density estimation , 2001, Springer series in statistics.

[11]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[12]  M. Rudelson,et al.  Smallest singular value of random matrices and geometry of random polytopes , 2005 .

[13]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[14]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[15]  Adam Tauman Kalai,et al.  Disentangling Gaussians , 2012, Commun. ACM.

[16]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[17]  Rocco A. Servedio,et al.  Explorer Efficient Density Estimation via Piecewise Polynomial Approximation , 2013 .

[18]  Alon Orlitsky,et al.  Near-Optimal-Sample Estimators for Spherical Gaussian Mixtures , 2014, NIPS.

[19]  Santosh S. Vempala,et al.  Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[20]  Shay Moran,et al.  Sample compression schemes for VC classes , 2015, 2016 Information Theory and Applications Workshop (ITA).

[21]  Ilias Diakonikolas,et al.  Learning Structured Distributions , 2016, Handbook of Big Data.

[22]  Andreas Krause,et al.  Training Mixture Models at Scale via Coresets , 2017 .

[23]  Daniel M. Kane,et al.  Statistical Query Lower Bounds for Robust Estimation of High-Dimensional Gaussians and Gaussian Mixtures , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[24]  Shai Ben-David,et al.  Sample-Efficient Learning of Mixtures , 2017, AAAI.