Robust k-means++

A good seeding or initialization of cluster centers for the k-means method is important from both theoretical and practical standpoints. The k-means objective is inherently non-robust and sensitive to outliers. A popular seeding such as the k-means++ [3] that is more likely to pick outliers in the worst case may compound this drawback, thereby affecting the quality of clustering on noisy data. For any 0 < δ ≤ 1, we show that using a mixture of D [3] and uniform sampling, we can pick O(k/δ) candidate centers with the following guarantee: they contain some k centers that give O(1)-approximation to the optimal robust k-means solution while discarding at most δn more points than the outliers discarded by the optimal solution. That is, if the optimal solution discards its farthest βn points as outliers, our solution discards its (β + δ)n points as outliers. The constant factor in our O(1)approximation does not depend on δ. This is an improvement over previous results for k-means with outliers based on LP relaxation and rounding [7] and local search [17]. The O(k/δ) sized subset can be found in time O(ndk). Our robust k-means++ is also easily amenable to scalable, faster, parallel implementations of k-means++ [5]. Our empirical results show a comparison of the above robust variant of k-means++ with the usual k-means++, uniform random seeding, threshold k-means++ [6] and local search on real world and synthetic data. ∗Corresponding Author. Proceedings of the 36 Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[3]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[4]  Sergei Vassilvitskii,et al.  Local Search Methods for k-Means with Outliers , 2017, Proc. VLDB Endow..

[5]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[6]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[7]  Alexandros Georgogiannis,et al.  Robust k-means: a Theoretical Revisit , 2016, NIPS.

[8]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[9]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[10]  Ke Chen,et al.  A constant factor approximation algorithm for k-median clustering with outliers , 2008, SODA '08.

[11]  Aristides Gionis,et al.  k-means-: A Unified Approach to Clustering and Outlier Detection , 2013, SDM.

[12]  Aditya Bhaskara,et al.  Greedy Sampling for Approximate Clustering in the Presence of Outliers , 2019, NeurIPS.

[13]  A. Gordaliza,et al.  Robustness Properties of k Means and Trimmed k Means , 1999 .

[14]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[15]  Jakub W. Pachocki,et al.  Geometric median in nearly linear time , 2016, STOC.

[16]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[17]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[18]  Andrea Vattani,et al.  k-means Requires Exponentially Many Iterations Even in the Plane , 2008, SCG '09.

[19]  Mohammad R. Salavatipour,et al.  Approximation Schemes for Clustering with Outliers , 2018, SODA.

[20]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[21]  Shi Li,et al.  Constant approximation for k-median and k-means with outliers via iterative rounding , 2017, STOC.

[22]  Jun Li,et al.  Clustering With Outlier Removal , 2018, IEEE Transactions on Knowledge and Data Engineering.

[23]  Andreas Krause,et al.  Fast and Provably Good Seedings for k-Means , 2016, NIPS.