Fast and Provably Good Seedings for k-Means

Seeding - the task of finding initial cluster centers - is critical in obtaining high-quality clusterings for k-Means. However, k-means++ seeding, the state of the art algorithm, does not scale well to massive datasets as it is inherently sequential and requires k full passes through the data. It was recently shown that Markov chain Monte Carlo sampling can be used to efficiently approximate the seeding step of k-means++. However, this result requires assumptions on the data generating distribution. We propose a simple yet fast seeding algorithm that produces *provably* good clusterings even *without assumptions* on the data. Our analysis shows that the algorithm allows for a favourable trade-off between solution quality and computational cost, speeding up k-means++ seeding by up to several orders of magnitude. We validate our theoretical results in extensive experiments on a variety of real-world data sets.

[1]  Amit Kumar,et al.  A Simple D 2-Sampling Based PTAS for k-Means and other Clustering Problems , 2012, COCOON.

[2]  Amit Kumar,et al.  A Simple D2-Sampling Based PTAS for k-Means and Other Clustering Problems , 2012, Algorithmica.

[3]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[4]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[5]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[6]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[9]  Heiko Röglin,et al.  A bad instance for k-means++ , 2013, Theor. Comput. Sci..

[10]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[11]  Exact bound for the convergence of metropolis chains , 2000 .

[12]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[13]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[14]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[15]  Sven Koenig,et al.  Improved analysis of D* , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[16]  Ragesh Jaiswal,et al.  Improved analysis of D2-sampling based PTAS for k-means and other clustering problems , 2015, Inf. Process. Lett..

[17]  Johannes Blömer,et al.  Bregman Clustering for Separable Instances , 2010, SWAT.

[18]  Daniel Boley,et al.  Bregman Divergences and Triangle Inequality , 2013, SDM.

[19]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.