论文信息 - Approximation Schemes for Clustering with Outliers

Approximation Schemes for Clustering with Outliers

Clustering problems are well studied in a variety of fields, such as data science, operations research, and computer science. Such problems include variants of center location problems, k-median and k-means to name a few. In some cases, not all data points need to be clustered; some may be discarded for various reasons. For instance, some points may arise from noise in a dataset or one might be willing to discard a certain fraction of the points to avoid incurring unnecessary overhead in the cost of a clustering solution. We study clustering problems with outliers. More specifically, we look at uncapacitated facility location (UFL), k-median, and k-means. In these problems, we are given a set X of data points in a metric space δ(., .), a set C of possible centers (each maybe with an opening cost), maybe an integer parameter k, plus an additional parameter z as the number of outliers. In uncapacitated facility location with outliers, we have to open some centers, discard up to z points of X, and assign every other point to the nearest open center, minimizing the total assignment cost plus center opening costs. In k-median and k-means, we have to open up to k centers, but there are no opening costs. In k-means, the cost of assigning j to i is δ2(j, i). We present several results. Our main focus is on cases where δ is a doubling metric (this includes fixed dimensional Euclidean metrics as a special case) or is the shortest path metrics of graphs from a minor-closed family of graphs. For uniform-cost UFL with outliers on such metrics, we show that a multiswap simple local search heuristic yields a PTAS. With a bit more work, we extend this to bicriteria approximations for the k-median and k-means problems in the same metrics where, for any constant ε > 0, we can find a solution using (1 + ε)k centers whose cost is at most a (1 + ε)-factor of the optimum and uses at most z outliers. Our algorithms are all based on natural multiswap local search heuristics. We also show that natural local search heuristics that do not violate the number of clusters and outliers for k-median (or k-means) will have unbounded gap even in Euclidean metrics. Furthermore, we show how our analysis can be extended to general metrics for k-means with outliers to obtain a (25 + ε, 1 + ε)-approximation: an algorithm that uses at most (1 + ε)k clusters and whose cost is at most 25 + ε of optimum and uses no more than z outliers.

[1] Sergei Vassilvitskii,et al. Local Search Methods for k-Means with Outliers , 2017, Proc. VLDB Endow..

[2] Marek Karpinski,et al. Approximation schemes for clustering problems , 2003, STOC '03.

[3] Vijay V. Vazirani,et al. Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[4] Pavel Berkhin,et al. A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[5] David M. Mount,et al. A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[6] Satish Rao,et al. Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[7] Mary Inaba,et al. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[8] Sanjeev Arora,et al. Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems , 1998, JACM.

[9] Sariel Har-Peled,et al. Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[10] Dan Feldman,et al. A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[11] Pierre Hansen,et al. NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[12] Amin Saberi,et al. A new greedy approach for facility location problems , 2002, STOC '02.

[13] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14] Shi Li,et al. Constant approximation for k-median and k-means with outliers via iterative rounding , 2017, STOC.

[15] Shi Li,et al. A 1.488 approximation algorithm for the uncapacitated facility location problem , 2011, Inf. Comput..

[16] Anupam Gupta,et al. Simpler Analyses of Local Search Algorithms for Facility Location , 2008, ArXiv.

[17] Ola Svensson,et al. Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[18] Meena Mahajan,et al. The Planar k-means Problem is NP-hard I , 2009 .

[19] M. Inaba. Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[20] Samir Khuller,et al. Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[21] Anil K. Jain. Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[22] Philip N. Klein,et al. Local Search Yields Approximation Schemes for k-Means and k-Median in Euclidean and Minor-Free Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[23] Ravishankar Krishnaswamy,et al. The Non-Uniform k-Center Problem , 2016, ICALP.

[24] Sariel Har-Peled,et al. On coresets for k-means and k-median clustering , 2004, STOC '04.

[25] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[26] Victoria J. Hodge,et al. A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[27] Kamesh Munagala,et al. Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[28] Euiwoong Lee,et al. Improved and simplified inapproximability for k-means , 2015, Inf. Process. Lett..

[29] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[30] Sariel Har-Peled,et al. Algorithms on Clustering, Orienteering, and Conflict -Free Coloring , 2007 .

[31] Andrea Vattani. The hardness of k-means clustering in the plane , 2010 .

[32] Ke Chen,et al. A constant factor approximation algorithm for k-median clustering with outliers , 2008, SODA '08.

[33] Samir Khuller,et al. Algorithms for facility location problems with outliers , 2001, SODA '01.

[34] Sayan Bandyapadhyay,et al. On Variants of k-means Clustering , 2015, SoCG.

[35] Mohammad R. Salavatipour,et al. Local Search Yields a PTAS for k-Means in Doubling Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[36] Kamesh Munagala,et al. Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[37] Amit Kumar,et al. A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[38] Thrasyvoulos N. Pappas,et al. An Adaptive Clustering Algorithm For Image Segmentation , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[39] Shi Li,et al. Approximating k-median via pseudo-approximation , 2012, STOC '13.

[40] Alan M. Frieze,et al. Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[41] Amit Kumar,et al. Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[42] Chaitanya Swamy,et al. Approximation Algorithms for Clustering Problems with Lower Bounds and Outliers , 2016, ICALP.

[43] J. Matou. On Approximate Geometric K-clustering , 1999 .

[44] Vincent Cohen-Addad,et al. A Fast Approximation Scheme for Low-Dimensional k-Means , 2017, SODA.

[45] Aravind Srinivasan,et al. An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.