Clustering by Soft-constraint Affinity Propagation: Applications to Gene-expression Data

MOTIVATION Similarity-measure-based clustering is a crucial problem appearing throughout scientific data analysis. Recently, a powerful new algorithm called Affinity Propagation (AP) based on message-passing techniques was proposed by Frey and Dueck (2007a). In AP, each cluster is identified by a common exemplar all other data points of the same cluster refer to, and exemplars have to refer to themselves. Albeit its proved power, AP in its present form suffers from a number of drawbacks. The hard constraint of having exactly one exemplar per cluster restricts AP to classes of regularly shaped clusters, and leads to suboptimal performance, e.g. in analyzing gene expression data. RESULTS This limitation can be overcome by relaxing the AP hard constraints. A new parameter controls the importance of the constraints compared to the aim of maximizing the overall similarity, and allows to interpolate between the simple case where each data point selects its closest neighbor as an exemplar and the original AP. The resulting soft-constraint affinity propagation (SCAP) becomes more informative, accurate and leads to more stable clustering. Even though a new a priori free parameter is introduced, the overall dependence of the algorithm on external tuning is reduced, as robustness is increased and an optimal strategy for parameter selection emerges more naturally. SCAP is tested on biological benchmark data, including in particular microarray data related to various cancer types. We show that the algorithm efficiently unveils the hierarchical cluster structure present in the data sets. Further on, it allows to extract sparse gene expression signatures for each cluster.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Riccardo Zecchina,et al.  Learning by message-passing in networks of discrete synapses , 2005, Physical review letters.

[3]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[4]  Claudia Mauri,et al.  Therapeutic activity of agonistic monoclonal antibodies against CD40 in a chronic autoimmune inflammatory process , 2000, Nature Medicine.

[5]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Y. Kabashima A CDMA multiuser detection algorithm on the basis of belief propagation , 2003 .

[7]  Tomer Hertz,et al.  Pairwise Clustering and Graphical Models , 2003, NIPS.

[8]  Eytan Domany,et al.  Superparamagnetic Clustering of Data , 1996 .

[9]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[10]  Brendan J. Frey,et al.  Mixture Modeling by Affinity Propagation , 2005, NIPS.

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[13]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[14]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[15]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[16]  Marcel Dettling,et al.  BagBoosting for tumor classification with gene expression data , 2004, Bioinform..

[17]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[18]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[19]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[20]  M. Mézard,et al.  Analytic and Algorithmic Solution of Random Satisfiability Problems , 2002, Science.