Genome-wide discovery of transcriptional modules from DNA sequence and gene expression

In this paper, we describe an approach for understanding transcriptional regulation from both gene expression and promoter sequence data. We aim to identify transcriptional modules--sets of genes that are co-regulated in a set of experiments, through a common motif profile. Using the EM algorithm, our approach refines both the module assignment and the motif profile so as to best explain the expression data as a function of transcriptional motifs. It also dynamically adds and deletes motifs, as required to provide a genome-wide explanation of the expression data. We evaluate the method on two Saccharomyces cerevisiae gene expression data sets, showing that our approach is better than a standard one at recovering known motifs and at generating biologically coherent modules. We also combine our results with binding localization data to obtain regulatory relationships with known transcription factors, and show that many of the inferred relationships have support in the literature.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[3]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[4]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[5]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[6]  David Botstein,et al.  SGD: Saccharomyces Genome Database , 1998, Nucleic Acids Res..

[7]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[8]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[9]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[10]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[11]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[12]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[13]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[14]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[15]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[16]  Nir Friedman,et al.  A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites , 2001, WABI.

[17]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[18]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[19]  Nir Friedman,et al.  From promoter sequence to expression: a probabilistic framework , 2002, RECOMB '02.

[20]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[21]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.