Clustering via Similarity Functions : Theoretical Foundations and Algorithms ∗

Problems of clustering data from pairwise similarity information arise in many different fields. Yet the question of which algorithm is best to use under what conditions, and how good a notion of similarity does one need in order to cluster accurately remains poorly understood. In this work we propose a new general framework for analyzing clustering from similarity information that directly addresses this question of what properties of a similarity measure are sufficient to cluster accurately and by what kinds of algorithms. We show that in our framework a wide variety of interesting learning-theoretic and game-theoretic properties, including properties motivated by mathematical biology, can be used to cluster well, and we design new efficient algorithms that are able to take advantage of them. We consider two natural clustering objectives: (a) list clustering, where the algorithm’s goal is to produce a small list of clusterings such that at least one of them is approximately correct, and (b) hierarchical clustering, where the algorithm’s goal is to produce a hierarchy such that desired clustering is some pruning of this tree (which a user could navigate). We develop a notion of the clustering complexity of a given property, analogous to notions of capacity in learning theory, that characterizes information-theoretic usefulness for clustering. We analyze this quantity for a wide range of properties, giving tight upper and lower bounds. We also show how our algorithms can be extended to the inductive case, i.e., by using just a constant-sized sample, as in property testing. While our algorithms for this setting remain very efficient, proving correctness requires subtle analysis based on regularity-type results. Our framework can be viewed as an analog for clustering of discriminative models for supervised classification (i.e., the Statistical Learning Theory framework and the PAC learning model), where our goal is to cluster accurately given a property or relation the similarity function is believed to satisfy with respect to the ground truth clustering. More specifically our framework is analogous to that of data-dependent concept classes in supervised learning, where conditions such as the large margin property have been central in the analysis of kernel methods.

[1]  Mark Braverman,et al.  Finding Low Error Clusterings , 2009, COLT.

[2]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[3]  R. Zadeh Interactive Clustering , 2009 .

[4]  Maria-Florina Balcan,et al.  Clustering with Interactive Feedback , 2008, ALT.

[5]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[6]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2008, SIAM J. Comput..

[7]  S. Ben-David,et al.  Which Data Sets are ‘Clusterable’? – A Theoretical Study of Clusterability , 2008 .

[8]  Katherine A. Heller,et al.  Efficient Bayesian methods for clustering. , 2008 .

[9]  Shai Ben-David,et al.  Stability of k -Means Clustering , 2007, COLT.

[10]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[11]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[12]  Anirban Dasgupta,et al.  Spectral Clustering by Recursive Partitioning , 2006, ESA.

[13]  Maria-Florina Balcan,et al.  On a theory of learning with similarity functions , 2006, ICML.

[14]  Jon M. Kleinberg,et al.  On learning mixtures of heavy-tailed distributions , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[15]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[16]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[17]  S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[18]  Santosh S. Vempala,et al.  On Kernels, Margins, and Low-Dimensional Mappings , 2004, ALT.

[19]  Artur Czumaj,et al.  Sublinear-Time Approximation for Clustering Via Random Sampling , 2004, ICALP.

[20]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[21]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[22]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[23]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[24]  Noga Alon,et al.  Random sampling and approximation of MAX-CSPs , 2003, J. Comput. Syst. Sci..

[25]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[26]  Christopher K. I. Williams Learning Kernel Classifiers , 2003 .

[27]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[28]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[29]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[30]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[31]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[32]  Vincent Berry,et al.  A Structured Family of Clustering and Tree Construction Methods , 2001, Adv. Appl. Math..

[33]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[34]  Moses Charikar,et al.  Approximating min-sum k-clustering in metric spaces , 2001, STOC '01.

[35]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[36]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[37]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[38]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[39]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[40]  Noga Alon,et al.  Testing of clustering , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[41]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[42]  Alan M. Frieze,et al.  Quick Approximation to Matrices and Applications , 1999, Comb..

[43]  Venkatesan Guruswami,et al.  Improved decoding of Reed-Solomon and algebraic-geometric codes , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[44]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[45]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[46]  Noga Alon,et al.  A Spectral Technique for Coloring Random 3-Colorable Graphs , 1997, SIAM J. Comput..

[47]  Viggo Kann,et al.  Hardness of Approximating Problems on Cubic Graphs , 1997, CIAC.

[48]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[49]  D. Welsh,et al.  A Spectral Technique for Coloring Random 3-Colorable Graphs , 1994 .

[50]  A. Dress,et al.  Weak hierarchies associated with similarity measures--an additive clustering technique. , 1989, Bulletin of mathematical biology.

[51]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[52]  Peter Elias,et al.  List decoding for noisy channels , 1957 .