Approximation Schemes for Clustering with Outliers

Clustering problems are well studied in a variety of fields, such as data science, operations research, and computer science. Such problems include variants of center location problems, k-median and k-means to name a few. In some cases, not all data points need to be clustered; some may be discarded for various reasons. For instance, some points may arise from noise in a dataset or one might be willing to discard a certain fraction of the points to avoid incurring unnecessary overhead in the cost of a clustering solution. We study clustering problems with outliers. More specifically, we look at uncapacitated facility location (UFL), k-median, and k-means. In these problems, we are given a set X of data points in a metric space δ(., .), a set C of possible centers (each maybe with an opening cost), maybe an integer parameter k, plus an additional parameter z as the number of outliers. In uncapacitated facility location with outliers, we have to open some centers, discard up to z points of X, and assign every other point to the nearest open center, minimizing the total assignment cost plus center opening costs. In k-median and k-means, we have to open up to k centers, but there are no opening costs. In k-means, the cost of assigning j to i is δ2(j, i). We present several results. Our main focus is on cases where δ is a doubling metric (this includes fixed dimensional Euclidean metrics as a special case) or is the shortest path metrics of graphs from a minor-closed family of graphs. For uniform-cost UFL with outliers on such metrics, we show that a multiswap simple local search heuristic yields a PTAS. With a bit more work, we extend this to bicriteria approximations for the k-median and k-means problems in the same metrics where, for any constant ε > 0, we can find a solution using (1 + ε)k centers whose cost is at most a (1 + ε)-factor of the optimum and uses at most z outliers. Our algorithms are all based on natural multiswap local search heuristics. We also show that natural local search heuristics that do not violate the number of clusters and outliers for k-median (or k-means) will have unbounded gap even in Euclidean metrics. Furthermore, we show how our analysis can be extended to general metrics for k-means with outliers to obtain a (25 + ε, 1 + ε)-approximation: an algorithm that uses at most (1 + ε)k clusters and whose cost is at most 25 + ε of optimum and uses no more than z outliers.

[1]  Sergei Vassilvitskii,et al.  Local Search Methods for k-Means with Outliers , 2017, Proc. VLDB Endow..

[2]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[3]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[4]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[5]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[6]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[7]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[8]  Sanjeev Arora,et al.  Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems , 1998, JACM.

[9]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[10]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[11]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[12]  Amin Saberi,et al.  A new greedy approach for facility location problems , 2002, STOC '02.

[13]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14]  Shi Li,et al.  Constant approximation for k-median and k-means with outliers via iterative rounding , 2017, STOC.

[15]  Shi Li,et al.  A 1.488 approximation algorithm for the uncapacitated facility location problem , 2011, Inf. Comput..

[16]  Anupam Gupta,et al.  Simpler Analyses of Local Search Algorithms for Facility Location , 2008, ArXiv.

[17]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[18]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[19]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[20]  Samir Khuller,et al.  Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[21]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[22]  Philip N. Klein,et al.  Local Search Yields Approximation Schemes for k-Means and k-Median in Euclidean and Minor-Free Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[23]  Ravishankar Krishnaswamy,et al.  The Non-Uniform k-Center Problem , 2016, ICALP.

[24]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[25]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[26]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[27]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[28]  Euiwoong Lee,et al.  Improved and simplified inapproximability for k-means , 2015, Inf. Process. Lett..

[29]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[30]  Sariel Har-Peled,et al.  Algorithms on Clustering, Orienteering, and Conflict -Free Coloring , 2007 .

[31]  Andrea Vattani The hardness of k-means clustering in the plane , 2010 .

[32]  Ke Chen,et al.  A constant factor approximation algorithm for k-median clustering with outliers , 2008, SODA '08.

[33]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[34]  Sayan Bandyapadhyay,et al.  On Variants of k-means Clustering , 2015, SoCG.

[35]  Mohammad R. Salavatipour,et al.  Local Search Yields a PTAS for k-Means in Doubling Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[36]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[37]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[38]  Thrasyvoulos N. Pappas,et al.  An Adaptive Clustering Algorithm For Image Segmentation , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[39]  Shi Li,et al.  Approximating k-median via pseudo-approximation , 2012, STOC '13.

[40]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[41]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[42]  Chaitanya Swamy,et al.  Approximation Algorithms for Clustering Problems with Lower Bounds and Outliers , 2016, ICALP.

[43]  J. Matou On Approximate Geometric K-clustering , 1999 .

[44]  Vincent Cohen-Addad,et al.  A Fast Approximation Scheme for Low-Dimensional k-Means , 2017, SODA.

[45]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.