论文信息 - Approximate K-Means++ in Sublinear Time

Approximate K-Means++ in Sublinear Time

The quality of K-Means clustering is extremely sensitive to proper initialization. The classic remedy is to apply k-means++ to obtain an initial set of centers that is provably competitive with the optimal solution. Unfortunately, k-means++ requires k full passes over the data which limits its applicability to massive datasets. We address this problem by proposing a simple and efficient seeding algorithm for K-Means clustering. The main idea is to replace the exact D2-sampling step in k-means++ with a substantially faster approximation based on Markov Chain Monte Carlo sampling. We prove that, under natural assumptions on the data, the proposed algorithm retains the full theoretical guarantees of k-means++ while its computational complexity is only sublinear in the number of data points. For such datasets, one can thus obtain a provably good clustering in sublinear time. Extensive experiments confirm that the proposed method is competitive with k-means++ on a variety of real-world, large-scale datasets while offering a reduction in runtime of several orders of magnitude.

[1] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[2] Ankit Aggarwal,et al. Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[3] Michael I. Jordan,et al. Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[4] Nitin Garg,et al. Analysis of k-Means++ for Separable Data , 2012, APPROX-RANDOM.

[5] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[6] Andreas Krause,et al. The next big one: Detecting earthquakes and other rare events from community-based sensors , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.

[7] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[8] Honglak Lee,et al. An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[9] Adam Meyerson,et al. Fast and Accurate k-means For Large Datasets , 2011, NIPS.

[10] Sudipto Guha,et al. Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[11] Klaus Jansen,et al. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques , 2006, Lecture Notes in Computer Science.

[12] Ragesh Jaiswal,et al. Improved analysis of D2-sampling based PTAS for k-means and other clustering problems , 2015, Inf. Process. Lett..

[13] Heiko Röglin,et al. A bad instance for k-means++ , 2011, Theor. Comput. Sci..

[14] Andrew Y. Ng,et al. Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[15] Thierry Bertin-Mahieux,et al. The Million Song Dataset , 2011, ISMIR.

[16] D. Pollard. Strong Consistency of $K$-Means Clustering , 1981 .

[17] Amit Kumar,et al. A Simple D 2-Sampling Based PTAS for k-Means and other Clustering Problems , 2012, COCOON.

[18] Yoshua Bengio,et al. Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[19] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[20] D. Sculley,et al. Web-scale k-means clustering , 2010, WWW '10.

[21] Manu Agarwal,et al. k-Means++ under approximation stability , 2015, Theor. Comput. Sci..

[22] Maria-Florina Balcan,et al. Approximate clustering without the approximation , 2009, SODA.

[23] Amit Kumar,et al. A Simple D2-Sampling Based PTAS for k-Means and Other Clustering Problems , 2012, Algorithmica.

[24] Sergei Vassilvitskii,et al. Scalable K-Means++ , 2012, Proc. VLDB Endow..

[25] Christian Sohler,et al. StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[26] Exact bound for the convergence of metropolis chains , 2000 .

[27] Nir Ailon,et al. Streaming k-means approximation , 2009, NIPS.

[28] W. K. Hastings,et al. Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[29] Andreas Krause,et al. Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[30] Rafail Ostrovsky,et al. Streaming k-means on well-clusterable data , 2011, SODA '11.