Coresets for Nonparametric Estimation - the Case of DP-Means

Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore the use of coresets - a data summarization technique originating from computational geometry - for this task. Coresets are weighted subsets of the data such that models trained on these coresets are provably competitive with models trained on the full dataset. Coresets sublinear in the dataset size allow for fast approximate inference with provable guarantees. Existing constructions, however, are limited to parametric problems. Using novel techniques in coreset construction we show the existence of coresets for DP-Means - a prototypical nonparametric clustering problem - and provide a practical construction algorithm. We empirically demonstrate that our algorithm allows us to efficiently trade off computation time and approximation error and thus scale DP-Means to large datasets. For instance, with coresets we can obtain a computational speedup of 45x at an approximation error of only 2.4% compared to solving on the full data set. In contrast, for the same subsample size, the "naive" approach of uniformly subsampling the data incurs an approximation error of 22.5%.

[1]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[2]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[3]  Andreas Krause,et al.  The next big one: Detecting earthquakes and other rare events from community-based sensors , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.

[4]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[5]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC '11.

[6]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[7]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[8]  Yee Whye Teh,et al.  Bayesian Nonparametric Models , 2010, Encyclopedia of Machine Learning.

[9]  M. Newton,et al.  A recursive algorithm for nonparametric analysis with missing data , 1999 .

[10]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[11]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[12]  Jun S. Liu,et al.  Sequential importance sampling for nonparametric Bayes models: The next generation , 1999 .

[13]  Andreas Krause,et al.  Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning , 2015, AISTATS.

[14]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[15]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[16]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[17]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[20]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[21]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .