论文信息 - Fast and Provably Good Seedings for k-Means

Fast and Provably Good Seedings for k-Means

Seeding - the task of finding initial cluster centers - is critical in obtaining high-quality clusterings for k-Means. However, k-means++ seeding, the state of the art algorithm, does not scale well to massive datasets as it is inherently sequential and requires k full passes through the data. It was recently shown that Markov chain Monte Carlo sampling can be used to efficiently approximate the seeding step of k-means++. However, this result requires assumptions on the data generating distribution. We propose a simple yet fast seeding algorithm that produces *provably* good clusterings even *without assumptions* on the data. Our analysis shows that the algorithm allows for a favourable trade-off between solution quality and computational cost, speeding up k-means++ seeding by up to several orders of magnitude. We validate our theoretical results in extensive experiments on a variety of real-world data sets.

[1] Amit Kumar,et al. A Simple D 2-Sampling Based PTAS for k-Means and other Clustering Problems , 2012, COCOON.

[2] Amit Kumar,et al. A Simple D2-Sampling Based PTAS for k-Means and Other Clustering Problems , 2012, Algorithmica.

[3] Yoshua Bengio,et al. Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[4] Sergei Vassilvitskii,et al. Scalable K-Means++ , 2012, Proc. VLDB Endow..

[5] Andreas Krause,et al. Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[6] Ankit Aggarwal,et al. Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[7] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[8] W. K. Hastings,et al. Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[9] Heiko Röglin,et al. A bad instance for k-means++ , 2013, Theor. Comput. Sci..

[10] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[11] Exact bound for the convergence of metropolis chains , 2000 .

[12] Patricio A. Vela,et al. A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[13] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[14] D. Sculley,et al. Web-scale k-means clustering , 2010, WWW '10.

[15] Sven Koenig,et al. Improved analysis of D* , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[16] Ragesh Jaiswal,et al. Improved analysis of D2-sampling based PTAS for k-means and other clustering problems , 2015, Inf. Process. Lett..

[17] Johannes Blömer,et al. Bregman Clustering for Separable Instances , 2010, SWAT.

[18] Daniel Boley,et al. Bregman Divergences and Triangle Inequality , 2013, SDM.

[19] Nir Ailon,et al. Streaming k-means approximation , 2009, NIPS.