Noisy, Greedy and Not So Greedy k-means++

The k-means++ algorithm due to Arthur and Vassilvitskii has become the most popular seeding method for Lloyd's algorithm. It samples the first center uniformly at random from the data set and the other $k-1$ centers iteratively according to $D^2$-sampling where the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of $O(\log k)$ in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to $D^2$-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an $O(\log k)$-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an $\Omega(\ell\cdot \log k)$-approximation in expectation where $\ell$ is the number of possible centers that are sampled in each iteration. We also study a variation, which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by $D^2$-sampling anymore. Instead in each iteration an adversary is allowed to change the probabilities arising from $D^2$-sampling individually for each point by a factor between $1-\epsilon_1$ and $1+\epsilon_2$ for parameters $\epsilon_1 \in [0,1)$ and $\epsilon_2 \ge 0$. We prove that noisy k-means++ compute an $O(\log^2 k)$-approximation in expectation. We also discuss some applications of this result.

[1]  Matus Telgarsky,et al.  Greedy bi-criteria approximations for k-medians and k-means , 2016, ArXiv.

[2]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[3]  Dennis Wei A Constant-Factor Bi-Criteria Approximation Guarantee for k-means++ , 2016, NIPS.

[4]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[5]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[6]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[7]  Heiko Röglin,et al.  A bad instance for k-means++ , 2013, Theor. Comput. Sci..

[8]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[9]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[10]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[11]  Meena Mahajan,et al.  The planar k-means problem is NP-hard , 2012, Theor. Comput. Sci..

[12]  Euiwoong Lee,et al.  Improved and simplified inapproximability for k-means , 2015, Inf. Process. Lett..

[13]  Mohammad R. Salavatipour,et al.  Local Search Yields a PTAS for k-Means in Doubling Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[14]  Václav Rozhon,et al.  K-means++: Few More Steps Yield Constant Approximation , 2020, ICML.

[15]  Sergei Vassilvitskii,et al.  K-means: algorithms, analyses, experiments , 2007 .

[16]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[17]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[18]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[19]  Silvio Lattanzi,et al.  A Better k-means++ Algorithm via Local Search , 2019, ICML.

[20]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[21]  Philip N. Klein,et al.  Local Search Yields Approximation Schemes for k-Means and k-Median in Euclidean and Minor-Free Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[22]  Nir Ailon,et al.  Tight lower bound instances for k-means++ in two dimensions , 2016, Theor. Comput. Sci..