A Nonparametric Bayesian Approach to Modeling Overlapping Clusters

Although clustering data into mutually exclusive partitions has been an extremely successful approach to unsupervised learning, there are many situations in which a richer model is needed to fully represent the data. This is the case in problems where data points actually simultaneously belong to multiple, overlapping clusters. For example a particular gene may have several functions, therefore belonging to several distinct clusters of genes, and a biologist may want to discover these through unsupervised modeling of gene expression data. We present a new nonparametric Bayesian method, the Infinite Overlapping Mixture Model (IOMM), for modeling overlapping clusters. The IOMM uses exponential family distributions to model each cluster and forms an overlapping mixture by taking products of such distributions, much like products of experts (Hinton, 2002). The IOMM allows an unbounded number of clusters, and assignments of points to (multiple) clusters is modeled using an Indian Buffet Process (IBP), (Griffiths and Ghahramani, 2006). The IOMM has the desirable properties of being able to focus in on overlapping regions while maintaining the ability to model a potentially infinite number of clusters which may overlap. We derive MCMC inference algorithms for the IOMM and show that these can be used to cluster movies into multiple genres. ∗ZG is also an Associate Research Professor in the Machine Learning Department at Carnegie Mellon University

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[3]  Eric Saund,et al.  Unsupervised Learning of Mixtures of Multiple Causes in Binary Data , 1993, NIPS.

[4]  Zoubin Ghahramani,et al.  Factorial Learning and the EM Algorithm , 1994, NIPS.

[5]  Eric Saund,et al.  Applying the Multiple Cause Mixture Model to Text Categorization , 1996, ICML.

[6]  Jianbo Shi,et al.  Learning Segmentation by Random Walks , 2000, NIPS.

[7]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[8]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[9]  Daphne Koller,et al.  Decomposing Gene Expression into Cellular Processes , 2002, Pacific Symposium on Biocomputing.

[10]  Milos Hauskrecht,et al.  Modeling Cellular Processes with Variational Bayesian Cooperative Vector Quantizer , 2003, Pacific Symposium on Biocomputing.

[11]  Daphne Koller,et al.  Probabilistic discovery of overlapping cellular processes and their regulation , 2004, J. Comput. Biol..

[12]  Joydeep Ghosh,et al.  Model-based overlapping clustering , 2005, KDD '05.

[13]  Thomas L. Griffiths,et al.  Infinite latent feature models and the Indian buffet process , 2005, NIPS.

[14]  Carl E. Rasmussen,et al.  A choice model with infinitely many latent features , 2006, ICML.

[15]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[16]  B. Schölkopf,et al.  Modeling Dyadic Data with Binary Latent Factors , 2007 .