Approximate K-Means++ in Sublinear Time

The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding algorithm for K-Means clustering. The main idea is to replace the exact D2-sampling step in k-means++ with a substantially faster approximation based on Markov Chain Monte Carlo sampling. We prove that, under natural assumptions on the data, the proposed algorithm retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points. For such datasets, one can thus obtain a provably good clustering in sublinear time. Extensive experiments confirm that the proposed method is competitive with k-means++ on a variety of real-world, large-scale datasets while offering a reduction in runtime of several orders of magnitude.

[1]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[2]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[3]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[4]  Nitin Garg,et al.  Analysis of k-Means++ for Separable Data , 2012, APPROX-RANDOM.

[5]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[6]  Andreas Krause,et al.  The next big one: Detecting earthquakes and other rare events from community-based sensors , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[9]  Adam Meyerson,et al.  Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[10]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Klaus Jansen,et al.  Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques , 2006, Lecture Notes in Computer Science.

[12]  Ragesh Jaiswal,et al.  Improved analysis of D2-sampling based PTAS for k-means and other clustering problems , 2015, Inf. Process. Lett..

[13]  Heiko Röglin,et al.  A bad instance for k-means++ , 2011, Theor. Comput. Sci..

[14]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[15]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[16]  D. Pollard Strong Consistency of $K$-Means Clustering , 1981 .

[17]  Amit Kumar,et al.  A Simple D 2-Sampling Based PTAS for k-Means and other Clustering Problems , 2012, COCOON.

[18]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[19]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[20]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[21]  Manu Agarwal,et al.  k-Means++ under approximation stability , 2015, Theor. Comput. Sci..

[22]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[23]  Amit Kumar,et al.  A Simple D2-Sampling Based PTAS for k-Means and Other Clustering Problems , 2012, Algorithmica.

[24]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[25]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[26]  Exact bound for the convergence of metropolis chains , 2000 .

[27]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[28]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[29]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[30]  Rafail Ostrovsky,et al.  Streaming k-means on well-clusterable data , 2011, SODA '11.