A unified probabilistic framework for clustering genes from gene expression and protein-protein interaction data

This paper presents a novel mixture model for clustering genes based on Gaussian and bernoulli distributed data. One typical application is to cluster genes with gene expression and protein-protein interaction (PPI) data. The underlying assumption is that genes within a cluster have on average more PPIs with a set of genes and share similar expression profiles than genes from different clusters. The proposed mixture model, GBMM, differs from its component models in its integration of different data types into a single and unified probabilistic modeling framework. Moreover, the model can be extended to other parametric distributions and, therefore, incorporate even more information in a coherent manner. We developed the expectation maximization algorithm for GBMM, and used four well-known approximation-based model selection criteria to test their performances under different scenarios. The results verify that combining expression and PPI data can greatly improve clustering accuracy compared with analyzing each single data source alone, and the more PPIs are known for a given set of genes the better performance improvement the algorithm can have.