Discovering Motifs in Ranked Lists of DNA Sequences

Computational methods for discovery of sequence elements that are enriched in a target set compared with a background set are fundamental in molecular biology research. One example is the discovery of transcription factor binding motifs that are inferred from ChIP–chip (chromatin immuno-precipitation on a microarray) measurements. Several major challenges in sequence motif discovery still require consideration: (i) the need for a principled approach to partitioning the data into target and background sets; (ii) the lack of rigorous models and of an exact p-value for measuring motif enrichment; (iii) the need for an appropriate framework for accounting for motif multiplicity; (iv) the tendency, in many of the existing methods, to report presumably significant motifs even when applied to randomly generated data. In this paper we present a statistical framework for discovering enriched sequence elements in ranked lists that resolves these four issues. We demonstrate the implementation of this framework in a software application, termed DRIM (discovery of rank imbalanced motifs), which identifies sequence motifs in lists of ranked DNA sequences. We applied DRIM to ChIP–chip and CpG methylation data and obtained the following results. (i) Identification of 50 novel putative transcription factor (TF) binding sites in yeast ChIP–chip data. The biological function of some of them was further investigated to gain new insights on transcription regulation networks in yeast. For example, our discoveries enable the elucidation of the network of the TF ARO80. Another finding concerns a systematic TF binding enhancement to sequences containing CA repeats. (ii) Discovery of novel motifs in human cancer CpG methylation data. Remarkably, most of these motifs are similar to DNA sequence elements bound by the Polycomb complex that promotes histone methylation. Our findings thus support a model in which histone methylation and CpG methylation are mechanistically linked. Overall, we demonstrate that the statistical framework embodied in the DRIM software tool is highly effective for identifying regulatory sequence elements in a variety of applications ranging from expression and ChIP–chip to CpG methylation data. DRIM is publicly available at http://bioinfo.cs.technion.ac.il/drim.

[1]  Jun S. Liu,et al.  De novo cis-regulatory module elicitation for eukaryotic genomes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[2]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[3]  Tim Hui-Ming Huang,et al.  Oligonucleotide‐based microarray for DNA methylation analysis: Principles and applications , 2003, Journal of cellular biochemistry.

[4]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[5]  P. Blaiseau,et al.  Multiple transcriptional activation complexes tether the yeast activator Met4 to DNA , 1998, The EMBO journal.

[6]  E. Ukkonen,et al.  Genome-wide Prediction of Mammalian Enhancers Based on Analysis of Transcription-Factor Binding Affinity , 2006, Cell.

[7]  Olga G. Troyanskaya,et al.  GOLEM: an interactive graph-based gene-ontology navigation and analysis tool , 2006, BMC Bioinformatics.

[8]  Ellen R. Laird,et al.  Molecular basis for interaction of the protein tyrosine kinase ZAP-70 with the T-cell receptor , 2007, Nature.

[9]  Armin Shmilovici,et al.  Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..

[10]  B. André,et al.  Transcriptional Induction by Aromatic Amino Acids in Saccharomyces cerevisiae , 1999, Molecular and Cellular Biology.

[11]  P. Blaiseau,et al.  Met31p and Met32p, two related zinc finger proteins, are involved in transcriptional regulation of yeast sulfur amino acid metabolism , 1997, Molecular and cellular biology.

[12]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[13]  Matthew D. W. Piper,et al.  Identification and Characterization of Phenylpyruvate Decarboxylase Genes in Saccharomyces cerevisiae , 2022 .

[14]  Renato Paro,et al.  Genome-wide prediction of Polycomb/Trithorax response elements in Drosophila melanogaster. , 2003, Developmental cell.

[15]  Ting Wang,et al.  An improved map of conserved regulatory sites for Saccharomyces cerevisiae , 2006, BMC Bioinformatics.

[16]  I. Simon,et al.  Evidence for an instructive mechanism of de novo methylation in cancer cells , 2006, Nature Genetics.

[17]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000 .

[18]  Ernest Fraenkel,et al.  Practical Strategies for Discovering Regulatory DNA Sequence Motifs , 2006, PLoS Comput. Biol..

[19]  R. Paro,et al.  Co‐localization of Polycomb protein and GAGA factor on regulatory elements responsible for the maintenance of homeotic gene expression , 1997, The EMBO journal.

[20]  J. Lieb,et al.  ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. , 2004, Genomics.

[21]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[22]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[23]  Thomas Lengauer,et al.  CpG Island Methylation in Human Lymphocytes Is Highly Correlated with DNA Sequence, Repeats, and Predicted DNA Structure , 2006, PLoS genetics.

[24]  R. Kingston,et al.  The Core of the Polycomb Repressive Complex Is Compositionally and Functionally Conserved in Flies and Humans , 2002, Molecular and Cellular Biology.

[25]  Nicola J. Rinaldi,et al.  Control of Pancreas and Liver Gene Expression by HNF Transcription Factors , 2004, Science.

[26]  Nir Friedman,et al.  A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites , 2001, WABI.

[27]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[28]  G. Stormo,et al.  ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[29]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[30]  Lars Juhl Jensen,et al.  Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation , 2000, Bioinform..

[31]  R. Irizarry,et al.  Gene expression analysis of ischemic and nonischemic cardiomyopathy: shared and distinct genes in the development of heart failure. , 2005, Physiological genomics.

[32]  Eric C. Rouchka,et al.  Gibbs Recursive Sampler: finding transcription factor binding sites , 2003, Nucleic Acids Res..

[33]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[34]  John J. Wyrick,et al.  Genome-wide location and function of DNA binding proteins. , 2000, Science.

[35]  Michael Q. Zhang,et al.  Adaptively inferring human transcriptional subnetworks , 2006, Molecular systems biology.

[36]  Gaston H. Gonnet,et al.  Scoring functions for transcription factor binding site prediction , 2005, BMC Bioinformatics.

[37]  Gary D. Stormo,et al.  Displaying the information contents of structural RNA alignments: the structure logos , 1997, Comput. Appl. Biosci..

[38]  Kristian Helin,et al.  Genome-wide mapping of Polycomb target genes unravels their roles in cell fate transitions. , 2006, Genes & development.

[39]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[40]  Nir Friedman,et al.  Scoring Genes for Relevance , 2000 .

[41]  Hao Li,et al.  Regulatory element detection using correlation with expression (abstract only) , 2001, RECOMB.

[42]  Michael Q. Zhang,et al.  Mining ChIP-chip data for transcription factor and cofactor binding sites , 2005, ISMB.

[43]  Jonathan Bard,et al.  Human-CMouse Gene Searcher: a tool to assist discovery of malformation-associated genes by using phenotype databases , 2005, Bioinform..

[44]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[45]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[46]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[47]  Saurabh Sinha,et al.  Stubb: a program for discovery and analysis of cis-regulatory modules , 2006, Nucleic Acids Res..

[48]  Richard G. Jenner,et al.  Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene activation in human tissues. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Qing Zhou,et al.  A boosting approach for motif modeling using ChIP-chip data , 2005, Bioinform..

[50]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[51]  W. Bluemke,et al.  Biotechnological production of 2-phenylethanol , 2002, Applied Microbiology and Biotechnology.

[52]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[53]  Eva K. Lee,et al.  Predicting aberrant CpG island methylation , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[54]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[55]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[56]  G. Stamatoyannopoulos,et al.  Cis-acting sequences that affect the expression of the human fetal gamma-globin genes. , 1985, Progress in clinical and biological research.

[57]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[58]  L. Kraal,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2009 .

[59]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[60]  Eleazar Eskin,et al.  Using Expression Data to Discover RNA and DNA Regulatory Sequence Motifs , 2004, Regulatory Genomics.

[61]  Megan F. Cole,et al.  Control of Developmental Regulators by Polycomb in Human Embryonic Stem Cells , 2006, Cell.

[62]  Richard Lavery,et al.  Macromolecular recognition. , 2005, Current opinion in structural biology.

[63]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[64]  G. Church,et al.  Computational identification of transcription factor binding sites via a transcription-factor-centric clustering (TFCC) algorithm. , 2002, Journal of molecular biology.

[65]  R Nussinov,et al.  Some guidelines for identification of recognition sequences: regulatory sequences frequently contain (T)GTG/CAC(A), TGA/TCA and (T)CTC/GAG(A). , 1986, Biochimica et biophysica acta.

[66]  Emden R. Gansner,et al.  An open graph visualization system and its applications to software engineering , 2000, Softw. Pract. Exp..

[67]  S. F. Anderson,et al.  UME6, a negative regulator of meiosis in saccharomyces cerevisiae, contains a C‐terminal Zn2Cys6 binuclear cluster that binds the URS1 DNA sequence in a zinc‐dependent manner , 1995, Protein science : a publication of the Protein Society.

[68]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.