Combining sequence and time series expression data to learn transcriptional modules

Our goal is to cluster genes into transcriptional modules¿sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present a generative probabilistic model for combining regulatory sequence and time series expression data to cluster genes into coherent transcriptional modules. Starting with a set of motifs representing known or putative regulatory elements (transcription factor binding sites) and the counts of occurrences of these motifs in each gene's promoter region, together with a time series expression profile for each gene, the learning algorithm uses expectation maximization to learn module assignments based on both types of data. We also present a technique based on the Jensen-Shannon entropy contributions of motifs in the learned model for associating the most significant motifs to each module. Thus, the algorithm gives a global approach for associating sets of regulatory elements to "modules¿ of genes with similar time series expression profiles. The model for expression data exploits our prior belief of smooth dependence on time by using statistical splines and is suitable for typical time course data sets with relatively few experiments. Moreover, the model is sufficiently interpretable that we can understand how both sequence data and expression data contribute to the cluster assignments, and how to interpolate between the two data sources. We present experimental results on the yeast cell cycle to validate our method and find that our combined expression and motif clustering algorithm discovers modules with both coherent expression and similar motif patterns, including binding motifs associated to known cell cycle transcription factors.

[1]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[2]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[3]  L. Breeden,et al.  SWI6 protein is required for transcription of the periodically expressed DNA synthesis genes in budding yeast , 1992, Nature.

[4]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[6]  Tommi S. Jaakkola,et al.  A new approach to analyzing gene expression time series data , 2002, RECOMB '02.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[9]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[10]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[11]  L. Johnston,et al.  Coordination of expression of DNA synthesis genes in budding yeast by a cell-cycle regulated trans factor , 1991, Nature.

[12]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[13]  Massimo Vergassola,et al.  Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo , 2002, BMC Bioinformatics.

[14]  Ian Holmes,et al.  Finding Regulatory Elements Using Joint Likelihoods for Sequence and Expression Profile Data , 2000, ISMB.

[15]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[16]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[17]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[18]  Naftali Tishby,et al.  Agglomerative Multivariate Information Bottleneck , 2001, NIPS.

[19]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[20]  M. Oh,et al.  Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. , 2001, Nucleic acids research.

[21]  Gareth M. James,et al.  Functional linear discriminant analysis for irregularly sampled curves , 2001 .

[22]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[23]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[24]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.