Gene Clustering via Integrated Markov Models Combining Individual and Pairwise Features

Clustering of genes into groups sharing common characteristics is a useful exploratory technique for a number of subsequent computational analysis. A wide range of clustering algorithms have been proposed in particular to analyze gene expression data, but most of them consider genes as independent entities or include relevant information on gene interactions in a suboptimal way. We propose a probabilistic model that has the advantage to account for individual data (e.g., expression) and pairwise data (e.g., interaction information coming from biological networks) simultaneously. Our model is based on hidden Markov random field models in which parametric probability distributions account for the distribution of individual data. Data on pairs, possibly reflecting distance or similarity measures between genes, are then included through a graph, where the nodes represent the genes, and the edges are weighted according to the available interaction information. As a probabilistic model, this model has many interesting theoretical features. In addition, preliminary experiments on simulated and real data show promising results and points out the gain in using such an approach. Availability: The software used in this work is written in C++ and is available with other supplementary material at http://mistis.inrialpes.fr/people/forbes/transparentia/supplementary.html.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Cordelia Schmid,et al.  Class-Specific Subspace Discriminant Analysis for High-Dimensional Data , 2005, SLSFS.

[4]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[6]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[7]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[8]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[9]  Tommi S. Jaakkola,et al.  Using Graphical Models and Genomic Expression Data to Statistically Validate Models of Genetic Regulatory Networks , 2000, Pacific Symposium on Biocomputing.

[10]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[11]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[12]  Sudeshna Adak,et al.  Genome-Wide Pathway Analysis and Visualization Using Gene Expression Data , 2001, Pacific Symposium on Biocomputing.

[13]  Haidong Wang,et al.  Discovering molecular pathways from protein interaction and gene expression data , 2003, ISMB.

[14]  Gilles Celeux,et al.  EM procedures using mean field-like approximations for Markov model-based image segmentation , 2003, Pattern Recognit..

[15]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[16]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[17]  M. Kanehisa,et al.  Graph-driven features extraction from microarray data , 2002, physics/0206055.

[18]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[19]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[20]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[21]  Yoshihiro Yamanishi,et al.  Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis , 2003, ISMB.

[22]  Jean-Philippe Vert,et al.  Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA , 2002, NIPS.

[23]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[24]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Florence Forbes,et al.  Hidden Markov Random Field Model Selection Criteria Based on Mean Field-Like Approximations , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[27]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.