P-value-based regulatory motif discovery using positional weight matrices

To analyze gene regulatory networks, the sequence-dependent DNA/RNA binding affinities of proteins and noncoding RNAs are crucial. Often, these are deduced from sets of sequences enriched in factor binding sites. Two classes of computational approaches exist. The first describe binding motifs by sequence patterns and search the patterns with highest statistical significance for enrichment. The second class uses the more powerful position weight matrices (PWMs). Instead of maximizing the statistical significance of enrichment, they maximize a likelihood. Here we present XXmotif (eXhaustive evaluation of matriX motifs), the first PWM-based motif discovery method that can optimize PWMs by directly minimizing their P-values of enrichment. Optimization requires computing millions of enrichment P-values for thousands of PWMs. For a given PWM, the enrichment P-value is calculated efficiently from the match P-values of all possible motif placements in the input sequences using order statistics. The approach can naturally combine P-values for motif enrichment, conservation, and localization. On ChIP-chip/seq, miRNA knock-down, and coexpression data sets from yeast and metazoans, XXmotif outperformed state-of-the-art tools, both in numbers of correctly identified motifs and in the quality of PWMs. In segmentation modules of D. melanogaster, we detect the known key regulators and several new motifs. In human core promoters, XXmotif reports most previously described and eight novel motifs sharply peaked around the transcription start site, among them an Initiator motif similar to the fly and yeast versions. XXmotif's sensitivity, reliability, and usability will help to leverage the quickly accumulating wealth of functional genomics data.

[1]  P Chambon,et al.  Promoter sequences of eukaryotic protein-coding genes. , 1980, Science.

[2]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[3]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[4]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[5]  S. Burley,et al.  Crystal structure of a TFIIB–TBP–TATA-element ternary complex , 1995, Nature.

[6]  S. Smale,et al.  Generality of a functional initiator consensus sequence. , 1996, Gene.

[7]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[8]  Michael Gribskov,et al.  Combining evidence using p-values: application to sequence homology searches , 1998, Bioinform..

[9]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[10]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[11]  M. Kozak Initiation of translation in prokaryotes and eukaryotes. , 1999, Gene.

[12]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[13]  Steven Henikoff,et al.  Chromatin profiling using targeted DNA adenine methyltransferase , 2001, Nature Genetics.

[14]  G. Rubin,et al.  Computational analysis of core promoters in the Drosophila genome , 2002, Genome Biology.

[15]  J. T. Kadonaga,et al.  The RNA polymerase II core promoter. , 2003, Annual review of biochemistry.

[16]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[17]  J. Fak,et al.  Transcriptional Control in the Segmentation Gene Network of Drosophila , 2004, PLoS biology.

[18]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[19]  David H. Sharp,et al.  Dynamic control of positional information in the early Drosophila embryo , 2004, Nature.

[20]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[21]  Naum I. Gershenzon,et al.  Synergy of human Pol II core promoter elements revealed by statistical sequence analysis , 2005, Bioinform..

[22]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[23]  M. Lässig,et al.  Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Christian A. Grove,et al.  A Gene-Centered C. elegans Protein-DNA Interaction Network , 2006, Cell.

[25]  David Sturgill,et al.  Comparative genomics of Drosophila and human core promoters , 2006, Genome Biology.

[26]  James T Kadonaga,et al.  Rational design of a super core promoter that enhances gene expression , 2006, Nature Methods.

[27]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[28]  M. Daly,et al.  Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). , 2005, Genome research.

[29]  Steven M. Johnson,et al.  Flexibility and constraint in the nucleosome core landscape of Caenorhabditis elegans chromatin. , 2006, Genome research.

[30]  G. Pavesi,et al.  Using Weeder for the Discovery of Conserved Transcription Factor Binding Sites , 2006, Current protocols in bioinformatics.

[31]  Naum I Gershenzon,et al.  The features of Drosophila core promoters revealed by statistical analysis , 2006, BMC Genomics.

[32]  Philipp Bucher,et al.  EPD in its twentieth year: towards complete promoter coverage of selected model organisms , 2005, Nucleic Acids Res..

[33]  Alexander J. Hartemink,et al.  Informative priors based on transcription factor structural class improve de novo motif discovery , 2006, ISMB.

[34]  Alexandre V. Morozov,et al.  Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE , 2006, ISMB.

[35]  Stijn van Dongen,et al.  miRBase: microRNA sequences, targets and gene nomenclature , 2005, Nucleic Acids Res..

[36]  Finn Drabløs,et al.  Improved benchmarks for computational motif discovery , 2007, BMC Bioinformatics.

[37]  D. Gifford,et al.  Tissue-specific transcriptional regulation has diverged significantly between human and mouse , 2007, Nature Genetics.

[38]  Colin N. Dewey,et al.  Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures , 2007, Nature.

[39]  Alexander J. Hartemink,et al.  A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast , 2007, PLoS Comput. Biol..

[40]  Jean-Stéphane Varré,et al.  Efficient and accurate P-value computation for Position Weight Matrices , 2007, Algorithms for Molecular Biology.

[41]  V. Iyer,et al.  FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. , 2007, Genome research.

[42]  Zhiping Weng,et al.  Analysis of overrepresented motifs in human core promoters reveals dual regulatory roles of YY1. , 2007, Genome research.

[43]  Manolis Kellis,et al.  Reliable prediction of regulator targets using 12 Drosophila genomes. , 2007, Genome research.

[44]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[45]  Zohar Yakhini,et al.  Discovering Motifs in Ranked Lists of DNA Sequences , 2007, PLoS Comput. Biol..

[46]  Boris Lenhard,et al.  Mammalian RNA polymerase II core promoters: insights from genome-wide studies , 2007, Nature Reviews Genetics.

[47]  Nak-Kyeong Kim,et al.  Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites , 2008, BMC Bioinformatics.

[48]  Mark Gerstein,et al.  Divergence of transcription factor binding sites across related yeast species. , 2007, Science.

[49]  T. Gojobori,et al.  Diversity of preferred nucleotide sequences around the translation initiation codon in eukaryote genomes , 2007, Nucleic acids research.

[50]  E. Birney,et al.  Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes , 2008, Nature Reviews Genetics.

[51]  R. Shamir,et al.  Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. , 2008, Genome research.

[52]  Christopher L. Warren,et al.  A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. , 2008, Molecular cell.

[53]  E. Segal,et al.  Predicting expression patterns from regulatory sequence in Drosophila segmentation , 2008, Nature.

[54]  Alexander Stark,et al.  Comparative genomics of gene regulation-conservation and divergence of cis-regulatory information. , 2009, Current opinion in genetics & development.

[55]  E. Furlong,et al.  Combinatorial binding predicts spatio-temporal cis-regulatory activity , 2009, Nature.

[56]  Eran Segal,et al.  From DNA sequence to transcriptional behaviour: a quantitative approach , 2009, Nature Reviews Genetics.

[57]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[58]  Saurabh Sinha,et al.  Evolution of Regulatory Sequences in 12 Drosophila Species , 2009, PLoS genetics.

[59]  Gary D. Stormo,et al.  Modeling the Quantitative Specificity of DNA-Binding Proteins from Example Binding Sites , 2009, PloS one.

[60]  Sayan Mukherjee,et al.  Evidence-ranked motif identification , 2010, Genome Biology.

[61]  Daniel E. Newburger,et al.  High-resolution DNA-binding specificity analysis of yeast transcription factors. , 2009, Genome research.

[62]  Marcel J. T. Reinders,et al.  Fewer permutations, more accurate P-values , 2009, Bioinform..

[63]  M. Mann,et al.  A SILAC-based DNA protein interaction screen that identifies candidate binding proteins to functional DNA elements. , 2009, Genome research.

[64]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[65]  Alexander J. Hartemink,et al.  Finding regulatory DNA motifs using alignment-free evolutionary conservation information , 2010, Nucleic acids research.

[66]  Gos Micklem,et al.  Supporting Online Material Materials and Methods Figs. S1 to S50 Tables S1 to S18 References Identification of Functional Elements and Regulatory Circuits by Drosophila Modencode , 2022 .

[67]  Scott B. Dewell,et al.  Transcriptome-wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP , 2010, Cell.

[68]  D. Corcoran,et al.  The TCT motif, a key component of an RNA polymerase II transcription system for the translational machinery. , 2010, Genes & development.

[69]  M. Mann,et al.  Defining the transcriptome and proteome in three functionally different human cell lines , 2010, Molecular systems biology.

[70]  M. Eisen,et al.  The Fitness Landscapes of cis-Acting Binding Sites in Different Promoter and Environmental Contexts , 2010, PLoS genetics.

[71]  Xin He,et al.  Thermodynamics-Based Models of Transcriptional Regulation by Enhancers: The Roles of Synergistic Activation, Cooperative Binding and Short-Range Repression , 2010, PLoS Comput. Biol..

[72]  Philip Machanick,et al.  The value of position-specific priors in motif discovery using MEME , 2010, BMC Bioinformatics.

[73]  J. T. Kadonaga,et al.  Regulation of gene expression via the core promoter and the basal transcriptional machinery. , 2010, Developmental biology.

[74]  Bart Deplancke,et al.  Automated protein-DNA interaction screening of Drosophila regulatory elements , 2011, Nature Methods.

[75]  A. Sandelin,et al.  Genomic and chromatin signals underlying transcription start-site selection. , 2011, Trends in genetics : TIG.

[76]  Nicholas T. Ingolia,et al.  Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity and Dynamics of Mammalian Proteomes , 2011, Cell.

[77]  E. O’Shea,et al.  Integrated approaches reveal determinants of genome-wide binding and function of the transcription factor Pho4. , 2011, Molecular cell.

[78]  M. Schroeder,et al.  How to make stripes: deciphering the transition from non-periodic to periodic patterns in Drosophila segmentation , 2011, Development.

[79]  Steven M. Gallo,et al.  REDfly v3.0: toward a comprehensive database of transcriptional regulatory elements in Drosophila , 2010, Nucleic Acids Res..

[80]  Michael Levine,et al.  Multiple enhancers ensure precision of gap gene-expression patterns in the Drosophila embryo , 2011, Proceedings of the National Academy of Sciences.

[81]  Jens Keilwagen,et al.  De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference , 2011, PLoS Comput. Biol..

[82]  Albert J. Vilella,et al.  A high-resolution map of human evolutionary constraint using 29 mammals , 2011, Nature.

[83]  Li M. Li,et al.  Long- and Short-Range Transcriptional Repressors Induce Distinct Chromatin States on Repressed Genes , 2011, Current Biology.

[84]  S. Luo,et al.  Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument , 2011, Nature Biotechnology.

[85]  J. Stamatoyannopoulos,et al.  The role of chromatin accessibility in directing the widespread, overlapping patterns of Drosophila transcription factor binding , 2011, Genome Biology.

[86]  Cole Trapnell,et al.  Computational methods for transcriptome annotation and quantification using RNA-seq , 2011, Nature Methods.