Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Coresets are efficient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clustering problem. We propose a single, practical algorithm to construct strong coresets for a large class of hard and soft clustering problems based on Bregman divergences. This class includes hard clustering with popular distortion measures such as the Squared Euclidean distance, the Mahalanobis distance, KL-divergence and Itakura-Saito distance. The corresponding soft clustering problems are directly related to popular mixture models due to a dual relationship between Bregman divergences and Exponential family distributions. Our theoretical results further imply a randomized polynomial-time approximation scheme for hard clustering. We demonstrate the practicality of the proposed algorithm in an empirical evaluation.

[1]  Yingyu Liang,et al.  Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.

[2]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[3]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[4]  Andreas Krause,et al.  Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning , 2015, AISTATS.

[5]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[6]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[7]  Johannes Blömer,et al.  Coresets and approximate clustering for Bregman divergences , 2009, SODA.

[8]  Richard Nock,et al.  On Bregman Voronoi diagrams , 2007, SODA '07.

[9]  Jon Louis Bentley,et al.  Decomposable Searching Problems , 1979, Inf. Process. Lett..

[10]  Alexander J. Smola,et al.  Communication Efficient Coresets for Empirical Loss Minimization , 2015, UAI.

[11]  Andreas Krause,et al.  The next big one: Detecting earthquakes and other rare events from community-based sensors , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.

[12]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[13]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[14]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[15]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[16]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[17]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[18]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[19]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[20]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[21]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[22]  Sariel Har-Peled Geometric Approximation Algorithms , 2011 .

[23]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[24]  Michael Schmitt,et al.  On the Complexity of Computing and Learning with Multiplicative Neural Networks , 2002, Neural Computation.

[25]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..