Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction

We study the problem of $k$-center clustering with outliers in arbitrary metrics and Euclidean space. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez's algorithm, for solving the problem of ordinary $k$-center clustering. Based on some novel observations, we show that this greedy strategy actually can handle $k$-center clustering with outliers efficiently, in terms of clustering quality and time complexity. We further show that the greedy approach yields small coreset for the problem in doubling metrics, so as to reduce the time complexity significantly. Our algorithms are easy to implement in practice. We test our method on both synthetic and real datasets. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower running times comparing with existing methods.

[1]  Jian Li,et al.  Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[2]  Mikhail Belkin,et al.  Problems of learning on manifolds , 2003 .

[3]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[4]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.

[5]  Avrim Blum,et al.  Foundations of Data Science , 2020 .

[6]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[7]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[8]  Maria-Florina Balcan,et al.  Center Based Clustering: A Foundational Perspective , 2014 .

[9]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[11]  Geppino Pucci,et al.  Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially , 2018, Proc. VLDB Endow..

[12]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[13]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[14]  Kunal Talwar,et al.  Bypassing the embedding: algorithms for low dimensional metrics , 2004, STOC '04.

[15]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[16]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[17]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[18]  Gustavo Malkomes,et al.  Fast Distributed k-Center Clustering with Outliers on Massive Data , 2015, NIPS.

[19]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20]  Michael E. Saks,et al.  Clustering is difficult only when it does not matter , 2012, ArXiv.

[21]  Shi Li,et al.  Distributed k-Clustering for Data with Heavy Noise , 2018, NeurIPS.

[22]  Noga Alon,et al.  Testing of clustering , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[23]  Jeff M. Phillips,et al.  Coresets and Sketches , 2016, ArXiv.

[24]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[25]  Samir Khuller,et al.  Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity , 2008, APPROX-RANDOM.

[26]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[27]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[28]  Ravishankar Krishnaswamy,et al.  The Non-Uniform k-Center Problem , 2016, ICALP.

[29]  Andreas Krause,et al.  Practical Coreset Constructions for Machine Learning , 2017, 1703.06476.

[30]  Micha Sharir,et al.  Relative (p,ε)-Approximations in Geometry , 2011, Discret. Comput. Geom..

[31]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[32]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[33]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[34]  Mohammad Ghodsi,et al.  A Composable Coreset for k-Center in Doubling Metrics , 2019, CCCG.

[35]  Sariel Har-Peled,et al.  Fast construction of nets in low dimensional metrics, and their applications , 2004, SCG.

[36]  R. Stephenson A and V , 1962, The British journal of ophthalmology.