A Theory of Similarity Functions for Clustering

Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as ground-truth and then design algorithms to (approximately) optimize various graph-based objective functions. However, in most applications, this similarity information is merely based on some heuristic: the true goal is to cluster the points correctly rather than to optimize any specific graph property. In this work, we initiate a theoretical study of the design of similarity functions for clustering from this perspective. In particular, motivated by recent work in learning theory that asks “what natural properties of a similarity function are sufficient to be able to learn well?” we ask “what natural properties of a similarity function are sufficient to be able to cluster well?” We develop a notion of the clustering complexity of a given property (analogous to notions of capacity in learning theory), that characterizes its information-theoretic usefulness for clustering. We then analyze this complexity for several natural game-theoretic and learning-theoretic properties, as well as design efficient algorithms that are able to take advantage of them. We consider two natural clustering objectives: (a) list clustering: analogous to the notion of list-decoding, the algorithm can produce a small list of clusterings (which a user can select from) and (b) hierarchical clustering: the desired clustering is some pruning of this tree (which a user could navigate). Our algorithms for hierarchical clustering combine recent learning-theoretic approaches with linkage-style methods. We also show how our algorithms can be extended to the inductive case, i.e., by using just a constantsized sample, as in property testing. The analysis here uses regularity-type results of [18] and [4].

[1]  Sanjeev Arora,et al.  Learning mixtures of arbitrary gaussians , 2001, STOC '01.

[2]  Frank McSherry,et al.  Spectral partitioning of random graphs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[3]  Dimitris Achlioptas,et al.  On Spectral Learning of Mixtures of Distributions , 2005, COLT.

[4]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[5]  Noga Alon,et al.  Random sampling and approximation of MAX-CSPs , 2003, J. Comput. Syst. Sci..

[6]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[7]  M. Charikar,et al.  Aggregating inconsistent information: ranking and clustering , 2005, STOC '05.

[8]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[9]  Maria-Florina Balcan,et al.  On a theory of learning with similarity functions , 2006, ICML.

[10]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[11]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[12]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Santosh S. Vempala,et al.  The Spectral Method for General Mixture Models , 2005, COLT.

[15]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[16]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[17]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[18]  Alan M. Frieze,et al.  Quick Approximation to Matrices and Applications , 1999, Comb..

[19]  Alexander Rakhlin,et al.  Stability of $K$-Means Clustering , 2006, NIPS.

[20]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[21]  Ralf Herbrich,et al.  Learning Kernel Classifiers , 2001 .

[22]  Anirban Dasgupta,et al.  Spectral Clustering by Recursive Partitioning , 2006, ESA.

[23]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[24]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[25]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[26]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[27]  Santosh S. Vempala,et al.  On Kernels, Margins, and Low-Dimensional Mappings , 2004, ALT.

[28]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[29]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[30]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[31]  Chaitanya Swamy,et al.  Correlation Clustering: maximizing agreements via semidefinite programming , 2004, SODA '04.

[32]  D. Welsh,et al.  A Spectral Technique for Coloring Random 3-Colorable Graphs , 1994 .

[33]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[34]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[35]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[36]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[37]  Tong Zhang,et al.  Regularized Winnow Methods , 2000, NIPS.