Approximate clustering without the approximation

Approximation algorithms for clustering points in metric spaces is a flourishing area of research, with much research effort spent on getting a better understanding of the approximation guarantees possible for many objective functions such as k-median, k-means, and min-sum clustering. This quest for better approximation algorithms is further fueled by the implicit hope that these better approximations also yield more accurate clusterings. E.g., for many problems such as clustering proteins by function, or clustering images by subject, there is some unknown correct "target" clustering and the implicit hope is that approximately optimizing these objective functions will in fact produce a clustering that is close pointwise to the truth. In this paper, we show that if we make this implicit assumption explicit---that is, if we assume that any c-approximation to the given clustering objective φ is e-close to the target---then we can produce clusterings that are O(e)-close to the target, even for values c for which obtaining a c-approximation is NP-hard. In particular, for k-median and k-means objectives, we show that we can achieve this guarantee for any constant c > 1, and for the min-sum objective we can do this for any constant c > 2. Our results also highlight a surprising conceptual difference between assuming that the optimal solution to, say, the k-median objective is e-close to the target, and assuming that any approximately optimal solution is e-close to the target, even for approximation factor say c = 1.01. In the former case, the problem of finding a solution that is O(e)-close to the target remains computationally hard, and yet for the latter we have an efficient algorithm.

[1]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[2]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[3]  Samir Khuller,et al.  Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[4]  Artur Czumaj,et al.  Sublinear-Time Approximation for Clustering Via Random Sampling , 2004, ICALP.

[5]  Satish Rao,et al.  Expander flows, geometric embeddings and graph partitioning , 2004, STOC '04.

[6]  Moses Charikar,et al.  Approximating min-sum k-clustering in metric spaces , 2001, STOC '01.

[7]  V VaziraniVijay,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001 .

[8]  Nabil H. Mustafa,et al.  k-means projective clustering , 2004, PODS.

[9]  Sreenivas Gollapudi,et al.  Programmable clustering , 2006, PODS.

[10]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[11]  Sergei Vassilvitskii,et al.  Worst-Case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-Means Method , 2009, SIAM J. Comput..

[12]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[13]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[14]  Artur Czumaj,et al.  Small Space Representations for Metric Min-Sum k -Clustering and Their Applications , 2007, STACS.

[15]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[16]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[17]  Mark Braverman,et al.  Finding Low Error Clusterings , 2009, COLT.

[18]  MunagalaKamesh,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004 .

[19]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[20]  Shai Ben-David,et al.  Stability of k -Means Clustering , 2007, COLT.

[21]  Maria-Florina Balcan,et al.  Efficient Clustering with Limited Distance Information , 2010, UAI.

[22]  Sergei Vassilvitskii,et al.  Worst-case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-means Method , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[23]  Avrim Blum,et al.  Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[24]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[25]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[26]  S. Ben-David,et al.  Which Data Sets are ‘Clusterable’? – A Theoretical Study of Clusterability , 2008 .

[27]  Leonard J. Schulman,et al.  Clustering for edge-cost minimization (extended abstract) , 2000, STOC '00.

[28]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[29]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2008, SIAM J. Comput..

[30]  Noga Alon,et al.  Testing of clustering , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[31]  Santosh S. Vempala,et al.  A discriminative framework for clustering via similarity functions , 2008, STOC.

[32]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[33]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[34]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[35]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[36]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[37]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[38]  Maria-Florina Balcan Better Guarantees for Sparsest Cut Clustering , 2009, COLT.

[39]  Maria-Florina Balcan,et al.  Agnostic Clustering , 2009, ALT.

[40]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[41]  David G. Stork,et al.  Pattern Classification , 1973 .

[42]  Amin Saberi,et al.  A new greedy approach for facility location problems , 2002, STOC '02.

[43]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[44]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[45]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[46]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[47]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[48]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[49]  Alexander Rakhlin,et al.  Stability of $K$-Means Clustering , 2006, NIPS.

[50]  Leonard J. Schulman,et al.  Clustering for Edge-Cost Minimization , 1999, Electron. Colloquium Comput. Complex..