Clustering under approximation stability

A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distance-based objective such as the k-median, k-means, or min-sum score. For applications such as clustering proteins by function or clustering images by subject, the implicit hope in taking this approach is that the optimal solution for the chosen objective will closely match the desired “target” clustering (e.g., a correct clustering of proteins by function or of images by who is in them). However, most distance-based objectives, including those mentioned here, are NP-hard to optimize. So, this assumption by itself is not sufficient, assuming P ≠ NP, to achieve clusterings of low-error via polynomial time algorithms. In this article, we show that we can bypass this barrier if we slightly extend this assumption to ask that for some small constant c, not only the optimal solution, but also all c-approximations to the optimal solution, differ from the target on at most some ε fraction of points—we call this (c,ε)-approximation-stability. We show that under this condition, it is possible to efficiently obtain low-error clusterings even if the property holds only for values c for which the objective is known to be NP-hard to approximate. Specifically, for any constant c > 1, (c,ε)-approximation-stability of k-median or k-means objectives can be used to efficiently produce a clustering of error O(ε) with respect to the target clustering, as can stability of the min-sum objective if the target clusters are sufficiently large. Thus, we can perform nearly as well in terms of agreement with the target clustering as if we could approximate these objectives to this NP-hard value.

[1]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[2]  Maria-Florina Balcan Better Guarantees for Sparsest Cut Clustering , 2009, COLT.

[3]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[4]  Samir Khuller,et al.  Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[5]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[6]  Marina Meila,et al.  The uniqueness of a good optimum for K-means , 2006, ICML.

[7]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[8]  David G. Stork,et al.  Pattern Classification , 1973 .

[9]  Maria-Florina Balcan,et al.  Clustering under Perturbation Resilience , 2011, SIAM J. Comput..

[10]  Teofilo F. Gonzalez,et al.  P-Complete Approximation Problems , 1976, J. ACM.

[11]  Amin Saberi,et al.  A new greedy approach for facility location problems , 2002, STOC '02.

[12]  Leonard J. Schulman,et al.  Clustering for Edge-Cost Minimization , 1999, Electron. Colloquium Comput. Complex..

[13]  Artur Czumaj,et al.  Small Space Representations for Metric Min-sum k-Clustering and Their Applications , 2007, Theory of Computing Systems.

[14]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[15]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2008, SIAM J. Comput..

[16]  Pranjal Awasthi,et al.  Improved Spectral-Norm Bounds for Clustering , 2012, APPROX-RANDOM.

[17]  Ankur Moitra,et al.  Settling the Polynomial Learnability of Mixtures of Gaussians , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[18]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[19]  Noga Alon,et al.  Testing of clustering , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[20]  Maria-Florina Balcan,et al.  Agnostic Clustering , 2009, ALT.

[21]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[22]  Shi Li,et al.  Approximating k-median via pseudo-approximation , 2012, STOC '13.

[23]  Nimrod Megiddo,et al.  On the Complexity of Some Common Geometric Location Problems , 1984, SIAM J. Comput..

[24]  Avrim Blum,et al.  Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[25]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[26]  Rafail Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, FOCS.

[27]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[28]  Marina Meila,et al.  Local equivalences of distances between clusterings—a geometric perspective , 2012, Machine Learning.

[29]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[30]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[31]  Beatrice Gralton,et al.  Washington DC - USA , 2008 .

[32]  Maria-Florina Balcan,et al.  Active Clustering of Biological Sequences , 2012, J. Mach. Learn. Res..

[33]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[34]  Noga Alon,et al.  Testing of Clustering , 2003, SIAM J. Discret. Math..

[35]  Michael Yu,et al.  Clustering with or without the Approximation , 2010, COCOON.

[36]  Mark Braverman,et al.  Finding Low Error Clusterings , 2009, COLT.

[37]  Satish Rao,et al.  Expander flows, geometric embeddings and graph partitioning , 2004, STOC '04.

[38]  Moses Charikar,et al.  Approximating min-sum k-clustering in metric spaces , 2001, STOC '01.

[39]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[40]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[41]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[42]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[43]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[44]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[45]  Santosh S. Vempala,et al.  A discriminative framework for clustering via similarity functions , 2008, STOC.

[46]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[47]  Nathan Linial,et al.  Are Stable Instances Easy? , 2009, Combinatorics, Probability and Computing.

[48]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[49]  Jacques van Helden,et al.  Evaluation of clustering algorithms for protein-protein interaction networks , 2006, BMC Bioinformatics.

[50]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[51]  Artur Czumaj,et al.  Small Space Representations for Metric Min-Sum k -Clustering and Their Applications , 2007, STACS.

[52]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[53]  Amit Kumar,et al.  Clustering with Spectral Norm and the k-Means Algorithm , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[54]  Avrim Blum,et al.  Center-based clustering under perturbation stability , 2010, Inf. Process. Lett..

[55]  Maria-Florina Balcan,et al.  Efficient Clustering with Limited Distance Information , 2010, UAI.

[56]  Mikhail Belkin,et al.  Polynomial Learning of Distribution Families , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[57]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[58]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[59]  Leonard J. Schulman,et al.  Clustering for edge-cost minimization (extended abstract) , 2000, STOC '00.

[60]  Michael Yu,et al.  Clustering with or without the approximation , 2013, J. Comb. Optim..

[61]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[62]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[63]  Dan Suciu,et al.  Journal of the ACM , 2006 .

[64]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[65]  Sreenivas Gollapudi,et al.  Programmable clustering , 2006, PODS.

[66]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[67]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[68]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[69]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[70]  Nabil H. Mustafa,et al.  k-means projective clustering , 2004, PODS.

[71]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[72]  Manu Agarwal,et al.  k-means++ under Approximation Stability , 2013, TAMC.