Hierarchical Affinity Propagation

Affinity propagation is an exemplar-based clustering algorithm that finds a set of data-points that best exemplify the data, and associates each datapoint with one exemplar. We extend affinity propagation in a principled way to solve the hierarchical clustering problem, which arises in a variety of domains including biology, sensor networks and decision making in operational research. We derive an inference algorithm that operates by propagating information up and down the hierarchy, and is efficient despite the high-order potentials required for the graphical model formulation. We demonstrate that our method outperforms greedy techniques that cluster one layer at a time. We show that on an artificial dataset designed to mimic the HIV-strain mutation dynamics, our method outperforms related methods. For real HIV sequences, where the ground truth is not available, we show our method achieves better results, in terms of the underlying objective function, and show the results correspond meaningfully to geographical location and strain subtypes. Finally we report results on using the method for the analysis of mass spectra, showing it performs favorably compared to state-of-the-art methods.

[1]  Daniel P. Miranker,et al.  Mining gene functional networks to improve mass-spectrometry-based protein identification , 2009, Bioinform..

[2]  Brendan J. Frey,et al.  FLoSS: Facility location for subspace segmentation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[3]  C. Brennan,et al.  Identification of HIV type 1 group N infections in a husband and wife in Cameroon: viral genome sequences provide evidence for horizontal transmission. , 2006, AIDS research and human retroviruses.

[4]  J. Yates,et al.  An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database , 1994, Journal of the American Society for Mass Spectrometry.

[5]  Tommi S. Jaakkola,et al.  Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations , 2007, NIPS.

[6]  Brendan J. Frey,et al.  Constructing Treatment Portfolios Using Affinity Propagation , 2008, RECOMB.

[7]  Jianxiong Xiao,et al.  Joint Affinity Propagation for Multiple View Segmentation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Martine Peeters,et al.  Geographical distribution of HIV‐1 group O viruses in Africa , 1997, AIDS.

[9]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[10]  D.P. Agrawal,et al.  APTEEN: a hybrid protocol for efficient routing and comprehensive information retrieval in wireless , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[11]  Andrew Blake,et al.  Probabilistic Tracking with Exemplars in a Metric Space , 2002, International Journal of Computer Vision.

[12]  Dariu Gavrila,et al.  A Bayesian, Exemplar-Based Approach to Hierarchical Shape Matching , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  G. Sumara,et al.  A Probabilistic Functional Network of Yeast Genes , 2004 .

[14]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[15]  Kresten Lindorff-Larsen,et al.  Similarity Measures for Protein Ensembles , 2009, PloS one.

[16]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[17]  Rong Wang,et al.  Integrating shotgun proteomics and mRNA expression data to improve protein identification , 2009, Bioinform..

[18]  Haldun Süral,et al.  A review of hierarchical facility location models , 2007, Comput. Oper. Res..

[19]  Brendan J. Frey,et al.  Solving the Uncapacitated Facility Location Problem Using Message Passing Algorithms , 2010, AISTATS.

[20]  E. Marcotte,et al.  Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation , 2007, Nature Biotechnology.

[21]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[22]  David L. Robertson,et al.  Recombination in AIDS viruses , 1995, Journal of Molecular Evolution.

[23]  R. Aebersold,et al.  A statistical model for identifying proteins by tandem mass spectrometry. , 2003, Analytical chemistry.

[24]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[25]  Alexey I Nesvizhskii,et al.  Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. , 2002, Analytical chemistry.

[26]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[27]  Brendan J. Frey,et al.  A Binary Variable Model for Affinity Propagation , 2009, Neural Computation.

[28]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.