Clustering by Passing Messages Between Data Points

Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.

[1]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  K. Kojima Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. , 1969 .

[4]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[6]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Alain Glavieux,et al.  Reflections on the Prize Paper : "Near optimum error-correcting coding and decoding: turbo codes" , 1998 .

[9]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  David J. C. MacKay,et al.  Good Error-Correcting Codes Based on Very Sparse Matrices , 1997, IEEE Trans. Inf. Theory.

[11]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[12]  D. Wilkin,et al.  Neuron , 2001, Brain Research.

[13]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[14]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[15]  M. Mézard,et al.  Analytic and Algorithmic Solution of Random Satisfiability Problems , 2002, Science.

[16]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence Project: update and current status , 2003, Nucleic Acids Res..

[17]  M. Mézard Passing Messages Between Disciplines , 2003, Science.

[18]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[19]  宁北芳,et al.  疟原虫var基因转换速率变化导致抗原变异[英]/Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A , 2005 .

[20]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[21]  B. Frey,et al.  Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs , 2005, Nature Genetics.

[22]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.