A Mixture model with random-effects components for clustering correlated gene-expression profiles

MOTIVATION The clustering of gene profiles across some experimental conditions of interest contributes significantly to the elucidation of unknown gene function, the validation of gene discoveries and the interpretation of biological processes. However, this clustering problem is not straightforward as the profiles of the genes are not all independently distributed and the expression levels may have been obtained from an experimental design involving replicated arrays. Ignoring the dependence between the gene profiles and the structure of the replicated data can result in important sources of variability in the experiments being overlooked in the analysis, with the consequent possibility of misleading inferences being made. We propose a random-effects model that provides a unified approach to the clustering of genes with correlated expression levels measured in a wide variety of experimental situations. Our model is an extension of the normal mixture model to account for the correlations between the gene profiles and to enable covariate information to be incorporated into the clustering process. Hence the model is applicable to longitudinal studies with or without replication, for example, time-course experiments by using time as a covariate, and to cross-sectional experiments by using categorical covariates to represent the different experimental classes. RESULTS We show that our random-effects model can be fitted by maximum likelihood via the EM algorithm for which the E(expectation)and M(maximization) steps can be implemented in closed form. Hence our model can be fitted deterministically without the need for time-consuming Monte Carlo approximations. The effectiveness of our model-based procedure for the clustering of correlated gene profiles is demonstrated on three real datasets, representing typical microarray experimental designs, covering time-course, repeated-measurement and cross-sectional data. In these examples, relevant clusters of the genes are obtained, which are supported by existing gene-function annotation. A synthetic dataset is considered too. AVAILABILITY A Fortran program blue called EMMIX-WIRE (EM-based MIXture analysis WIth Random Effects) is available on request from the corresponding author.

[1]  W. Pan,et al.  Model-based cluster analysis of microarray gene-expression data , 2002, Genome Biology.

[2]  Christophe Ambroise,et al.  Use of microarray data via model-based classification in the study and prediction of survival from lung cancer , 2005 .

[3]  A. Yakovlev,et al.  A New Type of Stochastic Dependence Revealed in Gene Expression Data , 2006, Statistical applications in genetics and molecular biology.

[4]  Michael A. Siani-Rose,et al.  A Knowledge-Based Clustering Algorithm Driven by Gene Ontology , 2004, Journal of biopharmaceutical statistics.

[5]  Wei Pan,et al.  Bioinformatics Original Paper Incorporating Gene Functions as Priors in Model-based Clustering of Microarray Gene Expression Data , 2022 .

[6]  Geoffrey J. McLachlan,et al.  Analyzing Microarray Gene Expression Data , 2004 .

[7]  K. Nasmyth,et al.  Cell cycle regulated transcription in yeast. , 1994, Current opinion in cell biology.

[8]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[9]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[10]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[11]  S. Ishii,et al.  Identification of expressed genes linked to malignancy of human colorectal carcinoma by parametric clustering of quantitative expression data , 2003, Genome Biology.

[12]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[13]  G. Celeux,et al.  Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments , 2005 .

[14]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[15]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[16]  H. Goldstein Multilevel Statistical Models , 2006 .

[17]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[18]  L. Karns,et al.  Histone H3 transcription in Saccharomyces cerevisiae is controlled by multiple cell cycle activation sites and a constitutive negative regulatory element , 1992, Molecular and cellular biology.

[19]  Simon Lin,et al.  Methods of microarray data analysis III , 2002 .

[20]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[21]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[23]  F. Vaida,et al.  Conditional Akaike information for mixed-effects models , 2005 .

[24]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[26]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[27]  Amanda Clare,et al.  How well do we understand the clusters found in microarray data? , 2002, Silico Biol..

[28]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[29]  Hongzhe Li,et al.  Clustering of time-course gene expression data using a mixed-effects model with B-splines , 2003, Bioinform..

[30]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[31]  G. McLachlan,et al.  On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples , 2004 .

[32]  Paul C. Boutros,et al.  Unsupervised pattern recognition: An introduction to the whys and wherefores of clustering microarray data , 2005, Briefings Bioinform..

[33]  S. R. Searle,et al.  Generalized, Linear, and Mixed Models , 2005 .

[34]  William Stafford Noble,et al.  The effect of replication on gene expression microarray experiments , 2003, Bioinform..

[35]  I. Herskowitz,et al.  The BUD4 protein of yeast, required for axial budding, is localized to the mother/BUD neck in a cell cycle-dependent manner , 1996, The Journal of cell biology.

[36]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[37]  Peter Adams,et al.  The EMMIX software for the fitting of mixtures of normal and t-components , 1999 .

[38]  A. Nordheim,et al.  Mcm1 is required to coordinate G2-specific transcription in Saccharomyces cerevisiae , 1995, Molecular and cellular biology.

[39]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[40]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[41]  Mike Tyers,et al.  Mechanisms that help the yeast cell cycle clock tick: G2 cyclins transcriptionally activate G2 cyclins and repress G1 cyclins , 1993, Cell.

[42]  John D. Storey,et al.  Significance analysis of time course microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Hagai Attias,et al.  A Variational Bayesian Framework for Graphical Models , 1999 .

[44]  D Gianola,et al.  A Bayesian threshold-normal mixture model for analysis of a continuous mastitis-related trait. , 2005, Journal of dairy science.

[45]  Daniel Gianola,et al.  Mixture model for inferring susceptibility to mastitis in dairy cattle: a procedure for likelihood-based inference , 2004, Genetics Selection Evolution.

[46]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .