Fast, Sensitive Discovery of Conserved Genome-Wide Motifs

Regulatory sites that control gene expression are essential to the proper functioning of cells, and identifying them is critical for modeling regulatory networks. We have developed Magma (Multiple Aligner of Genomic Multiple Alignments), a software tool for multiple species, multiple gene motif discovery. Magma identifies putative regulatory sites that are conserved across multiple species and occur near multiple genes throughout a reference genome. Magma takes as input multiple alignments that can include gaps. It uses efficient clustering methods that make it about 70 times faster than PhyloNet, a previous program for this task, with slightly greater sensitivity. We ran Magma on all non-coding DNA conserved between Caenorhabditis elegans and five additional species, about 70 Mbp in total, in <4 h. We obtained 2,309 motifs with lengths of 6-20 bp, each occurring at least 10 times throughout the genome, which collectively covered about 566 kbp of the genomes, approximately 0.8% of the input. Predicted sites occurred in all types of non-coding sequence but were especially enriched in the promoter regions. Comparisons to several experimental datasets show that Magma motifs correspond to a variety of known regulatory motifs.

[1]  N. Slonim,et al.  A universal framework for regulatory element discovery across all genomes and data types. , 2007, Molecular cell.

[2]  Lei Shen,et al.  Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes , 2005, Bioinform..

[3]  Vipin T. Sreedharan,et al.  A spatial and temporal map of C. elegans gene expression. , 2011, Genome research.

[4]  Obi L. Griffith,et al.  ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation , 2006, Bioinform..

[5]  Ting Wang,et al.  Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  T. Johnson,et al.  daf-16 integrates developmental and environmental inputs to mediate aging in the nematode Caenorhabditis elegans , 2001, Current Biology.

[7]  Paul W. Sternberg,et al.  RNA Pol II Accumulates at Promoters of Growth Genes During Developmental Arrest , 2009, Science.

[8]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[9]  David Gems,et al.  Shared Transcriptional Signature in Caenorhabditis elegans Dauer Larvae and Long-lived daf-2 Mutants Implicates Detoxification System in Longevity Assurance* , 2004, Journal of Biological Chemistry.

[10]  B. De Moor,et al.  In silico identification and experimental validation of PmrAB targets in Salmonella typhimurium by regulatory motif detection , 2004, Genome Biology.

[11]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[12]  N. D. Clarke,et al.  Explicit equilibrium modeling of transcription-factor binding and gene regulation , 2005, Genome Biology.

[13]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[14]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[15]  J. Lieb,et al.  DNA-binding specificity and in vivo targets of Caenorhabditis elegans nuclear factor I , 2009, Proceedings of the National Academy of Sciences.

[16]  Kathleen Marchal,et al.  More robust detection of motifs in coexpressed genes by using phylogenetic information , 2006, BMC Bioinformatics.

[17]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[18]  M S Gelfand,et al.  Recognition of regulatory sites by genomic comparison. , 1999, Research in microbiology.

[19]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Raymond K. Auerbach,et al.  Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project , 2010, Science.

[21]  Obi L. Griffith,et al.  ORegAnno: an open-access community-driven resource for regulatory annotation , 2007, Nucleic Acids Res..

[22]  G. Church,et al.  Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. , 2000, Genome research.

[23]  Andrey A Mironov,et al.  Regulation of biosynthesis and transport of aromatic amino acids in low-GC Gram-positive bacteria. , 2003, FEMS microbiology letters.

[24]  Steven J. M. Jones,et al.  High-Throughput In Vivo Analysis of Gene Expression in Caenorhabditis elegans , 2007, PLoS biology.

[25]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[26]  Nikolaus Rajewsky,et al.  The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons. , 2002, Genome research.

[27]  Albert-László Barabási,et al.  Genome-scale analysis of in vivo spatiotemporal promoter activity in Caenorhabditis elegans , 2007, Nature Biotechnology.

[28]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[29]  D. Church,et al.  Cross-species sequence comparisons: a review of methods and available resources. , 2003, Genome research.

[30]  Lee Ann McCue,et al.  Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites , 2003, Nature Biotechnology.

[31]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[32]  Olga G. Troyanskaya,et al.  Global Prediction of Tissue-Specific Gene Expression and Context-Dependent Gene Networks in Caenorhabditis elegans , 2009, PLoS Comput. Biol..

[33]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[34]  A A Mironov,et al.  Comparative analysis of FUR regulons in gamma-proteobacteria. , 2001, Nucleic acids research.

[35]  Joseph Y.-T. Leung,et al.  Efficient algorithms for interval graphs and circular-arc graphs , 1982, Networks.

[36]  Raymond E. Miller,et al.  Complexity of Computer Computations , 1972 .