A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data

BackgroundCluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible.ResultsThis paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC).ConclusionPerformance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction.

[1]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[2]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[3]  Yuka Okada,et al.  Expression of AP-1 (c-fos/c-jun) in developing mouse corneal epithelium , 2003, Graefe's Archive for Clinical and Experimental Ophthalmology.

[4]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[5]  Masayuki Yamamoto,et al.  A Combinatorial Code for Gene Expression Generated by Transcription Factor Bach2 and MAZR (MAZ-Related Factor) through the BTB/POZ Domain , 2000, Molecular and Cellular Biology.

[6]  Yuan Ji,et al.  Applications of beta-mixture models in bioinformatics , 2005, Bioinform..

[7]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[8]  G. Govaert,et al.  Choosing models in model-based clustering and discriminant analysis , 1999 .

[9]  Xiaoman Li,et al.  The induction of macrophage gene expression by LPS predominantly utilizes Myd88-independent signaling cascades. , 2004, Physiological genomics.

[10]  H. Akaike A new look at the statistical model identification , 1974 .

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  K. Itoh,et al.  Bach proteins belong to a novel family of BTB-basic leucine zipper transcription factors that interact with MafK and regulate transcription through the NF-E2 site , 1996, Molecular and cellular biology.

[13]  Bin Li,et al.  Uncovering a Macrophage Transcriptional Program by Integrating Evidence from Motif Scanning and Expression Dynamics , 2008, PLoS Comput. Biol..

[14]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[15]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[16]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[17]  H. Lähdesmäki,et al.  Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources , 2008, PloS one.

[18]  Wei Pan,et al.  Bioinformatics Original Paper Incorporating Gene Functions as Priors in Model-based Clustering of Microarray Gene Expression Data , 2022 .

[19]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[20]  A. Bowie,et al.  The Toll-IL-1 receptor adaptor family grows to five members. , 2003, Trends in immunology.