Scalable Training of Mixture Models via Coresets

How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3/e2) data points suffices for computing a (1 + e)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[3]  David Haussler,et al.  Epsilon-nets and simplex range queries , 1986, SCG '86.

[4]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[5]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  Sanjoy Dasgupta,et al.  A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[7]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[8]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixtures of distributions , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[9]  Michael Schmitt,et al.  On the Complexity of Computing and Learning with Multiplicative Neural Networks , 2002, Neural Computation.

[10]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[11]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Sariel Har-Peled,et al.  High-Dimensional Shape Fitting in Linear Time , 2003, SCG '03.

[14]  Sanjeev Arora,et al.  LEARNING MIXTURES OF SEPARATED NONSPHERICAL GAUSSIANS , 2005, math/0503457.

[15]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[16]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[17]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[18]  Dan Feldman Coresets for Weighted Facilities and Their Applications , 2006 .

[19]  Jon Feldman,et al.  PAC Learning Axis-Aligned Mixtures of Gaussians with No Separation Assumption , 2006, COLT.

[20]  Artur Czumaj,et al.  Sublinear‐time approximation algorithms for clustering via random sampling , 2007, Random Struct. Algorithms.

[21]  A. Czumaj,et al.  Sublinear-time approximation algorithms for clustering via random sampling , 2007 .

[22]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[23]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[24]  Andreas Krause,et al.  Discriminative Clustering by Regularized Information Maximization , 2010, NIPS.

[25]  Ankur Moitra,et al.  Settling the Polynomial Learnability of Mixtures of Gaussians , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[26]  Mikhail Belkin,et al.  Polynomial Learning of Distribution Families , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[27]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[28]  Andreas Krause,et al.  The next big one: Detecting earthquakes and other rare events from community-based sensors , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.