Training Gaussian Mixture Models at Scale via Coresets

How can we train a statistical mixture model on a massive data set? In this work we show how to construct coresets for mixtures of Gaussians. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size polynomial in dimension and the number of mixture components, while being independent of the data set size. Hence, one can harness computationally intensive algorithms to compute a good approximation on a significantly smaller data set. More importantly, such coresets can be efficiently constructed both in distributed and streaming settings and do not impose restrictions on the data generating process. Our results rely on a novel reduction of statistical estimation to problems in computational geometry and new combinatorial complexity results for mixtures of Gaussians. Empirical evaluation on several real-world data sets suggests that our coreset-based approach enables significant reduction in training-time with negligible approximation error.

[1]  Sanjeev Arora,et al.  LEARNING MIXTURES OF SEPARATED NONSPHERICAL GAUSSIANS , 2005, math/0503457.

[2]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[3]  Jeff M. Phillips,et al.  Coresets and Sketches , 2016, ArXiv.

[4]  Tamir Tassa,et al.  More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data , 2015, KDD.

[5]  Artem Barger,et al.  k-Means for Streaming and Distributed Big Sparse Data , 2015, SDM.

[6]  Mikhail Belkin,et al.  Polynomial Learning of Distribution Families , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[7]  Alexander J. Smola,et al.  Communication Efficient Coresets for Empirical Loss Minimization , 2015, UAI.

[8]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[9]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[10]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[11]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[12]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[13]  Dan Feldman,et al.  The single pixel GPS: learning big data signals from tiny coresets , 2012, SIGSPATIAL/GIS.

[14]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[15]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC '11.

[16]  John W. Fisher,et al.  Coresets for k-Segmentation of Streaming Data , 2014, NIPS.

[17]  Andreas Krause,et al.  The next big one: Detecting earthquakes and other rare events from community-based sensors , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.

[18]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[19]  Artur Czumaj,et al.  Sublinear‐time approximation algorithms for clustering via random sampling , 2007, Random Struct. Algorithms.

[20]  Dan Feldman Coresets for Weighted Facilities and Their Applications , 2006 .

[21]  Dan Feldman,et al.  Learning Big (Image) Data via Coresets for Dictionaries , 2013, Journal of Mathematical Imaging and Vision.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  Jon Feldman,et al.  PAC Learning Axis-Aligned Mixtures of Gaussians with No Separation Assumption , 2006, COLT.

[24]  Andreas Krause,et al.  Linear-Time Outlier Detection via Sensitivity , 2016, IJCAI.

[25]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[26]  Andreas Krause,et al.  Fast and Provably Good Seedings for k-Means , 2016, NIPS.

[27]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[28]  Ankur Moitra,et al.  Settling the Polynomial Learnability of Mixtures of Gaussians , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[29]  Sariel Har-Peled,et al.  High-Dimensional Shape Fitting in Linear Time , 2003, SCG '03.

[30]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[31]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[32]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[33]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[34]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[35]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[36]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[37]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[38]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[39]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[40]  Maxim Sviridenko,et al.  A Bi-Criteria Approximation Algorithm for k-Means , 2015, APPROX-RANDOM.

[41]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[42]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[43]  Andreas Krause,et al.  Practical Coreset Constructions for Machine Learning , 2017, 1703.06476.

[44]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[45]  Andreas Krause,et al.  Scalable and Distributed Clustering via Lightweight Coresets , 2017, ArXiv.

[46]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[47]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[48]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[49]  Yohji Akama,et al.  VC dimension of ellipsoids , 2011, ArXiv.

[50]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[51]  Michael D. Vose,et al.  A Linear Algorithm For Generating Random Numbers With a Given Distribution , 1991, IEEE Trans. Software Eng..

[52]  Sanjoy Dasgupta,et al.  A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[53]  Andreas Krause,et al.  Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning , 2015, AISTATS.

[54]  Yingyu Liang,et al.  Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.