Min-Sum Clustering (with Outliers)

We give a constant factor polynomial time pseudo-approximation algorithm for min-sum clustering with or without outliers. The algorithm is allowed to exclude an arbitrarily small constant fraction of the points. For instance, we show how to compute a solution that clusters 98\% of the input data points and pays no more than a constant factor times the optimal solution that clusters 99\% of the input data points. More generally, we give the following bicriteria approximation: For any $\eps > 0$, for any instance with $n$ input points and for any positive integer $n'\le n$, we compute in polynomial time a clustering of at least $(1-\eps) n'$ points of cost at most a constant factor greater than the optimal cost of clustering $n'$ points. The approximation guarantee grows with $\frac{1}{\eps}$. Our results apply to instances of points in real space endowed with squared Euclidean distance, as well as to points in a metric space, where the number of clusters, and also the dimension if relevant, is arbitrary (part of the input, not an absolute constant).

[1]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC '11.

[2]  Moses Charikar,et al.  Approximating min-sum k-clustering in metric spaces , 2001, STOC '01.

[3]  Chaitanya Swamy,et al.  Approximation Algorithms for Clustering Problems with Lower Bounds and Outliers , 2016, ICALP.

[4]  Bodo Manthey,et al.  Smoothed Analysis of the k-Means Method , 2011, JACM.

[5]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[6]  Venkatesan Guruswami,et al.  Embeddings and non-approximability of geometric problems , 2003, SODA '03.

[7]  Yair Bartal,et al.  On approximating arbitrary metrices by tree metrics , 1998, STOC '98.

[8]  Claire Mathieu,et al.  A Randomized Approximation Scheme for Metric MAX-CUT , 2001, J. Comput. Syst. Sci..

[9]  Nitin Garg,et al.  Analysis of k-Means++ for Separable Data , 2012, APPROX-RANDOM.

[10]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[11]  Vincent Cohen-Addad,et al.  On the Local Structure of Stable Clustering Instances , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[12]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[13]  Piotr Indyk A sublinear time approximation scheme for clustering in metric spaces , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[14]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[15]  Reza Bosagh Zadeh,et al.  A Uniqueness Theorem for Clustering , 2009, UAI.

[16]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[17]  Refael Hassin,et al.  Approximation Algorithms for Min-sum p-clustering , 1998, Discret. Appl. Math..

[18]  Yair Bartal,et al.  Probabilistic approximation of metric spaces and its algorithmic applications , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[19]  Satish Rao,et al.  A tight bound on approximating arbitrary metrics by tree metrics , 2003, STOC '03.

[20]  J. Matou On Approximate Geometric K-clustering , 1999 .

[21]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[22]  Amit Kumar,et al.  Linear Time Algorithms for Clustering Problems in Any Dimensions , 2005, ICALP.

[23]  Yuval Rabani,et al.  Approximating k-median with non-uniform capacities , 2005, SODA '05.

[24]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[25]  Shi Li,et al.  Constant Approximation for Capacitated k-Median with (1 + ε)-Capacity Violation , 2016, ArXiv.

[26]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[27]  S. KarthikC.,et al.  Inapproximability of Clustering in Lp Metrics , 2019, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[28]  Meena Mahajan,et al.  The planar k-means problem is NP-hard , 2009, Theor. Comput. Sci..

[29]  Rafail Ostrovsky,et al.  Streaming k-means on well-clusterable data , 2011, SODA '11.

[30]  Artur Czumaj,et al.  Small Space Representations for Metric Min-sum k-Clustering and Their Applications , 2007, Theory of Computing Systems.

[31]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[32]  R. Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[33]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[34]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[35]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[36]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[37]  Maria-Florina Balcan,et al.  Min-sum Clustering of Protein Sequences with Limited Distance Information , 2011, SIMBAD.

[38]  Mark Braverman,et al.  Finding Low Error Clusterings , 2009, COLT.

[39]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[40]  Avrim Blum,et al.  Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[41]  Mohammad R. Salavatipour,et al.  Approximation Algorithms for Min-Sum k-Clustering and Balanced k-Median , 2018, Algorithmica.

[42]  Rafail Ostrovsky,et al.  Secure two-party k-means clustering , 2007, CCS '07.

[43]  Leonard J. Schulman,et al.  Clustering for Edge-Cost Minimization , 1999, Electron. Colloquium Comput. Complex..