Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules

Backgroundcis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap.ResultsWe developed and implemented an algorithm computing the p-value that s different motifs occur respectively k1, ..., ksor more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA.MethodThe algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m|ℋMathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| + K|σ|K) ∏iki) time complexity, where n is the length of the text, |Σ| is the alphabet size, m is the maximal motif length, |ℋMathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| is the total number of words in motifs, K is the order of Markov model, and kiis the number of occurrences of the i th motif.ConclusionThe primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs.AvailabilityProject web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/AhoPro/

[1]  M. Régnier,et al.  Mathematical Tools for Regulatory Signals Extraction , 2004 .

[2]  G. K. Sandve,et al.  A survey of motif discovery methods in an integrated framework , 2006, Biology Direct.

[3]  J. Szostak,et al.  In vitro selection of RNA molecules that bind specific ligands , 1990, Nature.

[4]  Mireille Régnier,et al.  Rare Events and Conditional Events on Random Strings , 2004, Discret. Math. Theor. Comput. Sci..

[5]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[6]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[7]  J. Fickett,et al.  Identification of regulatory regions which confer muscle-specific gene expression. , 1998, Journal of molecular biology.

[8]  Kathleen Marchal,et al.  Computational Approaches to Identify Promoters and cis-Regulatory Elements in Plant Genomes1 , 2003, Plant Physiology.

[9]  Gesine Reinert,et al.  Compound Poisson and Poisson Process Approximations for Occurrences of Multiple Words in Markov Chains , 1998, J. Comput. Biol..

[10]  Nikolay A. Kolchanov,et al.  Bioinformatics of Genome Regulation and Structure , 2013, Springer US.

[11]  G. Kucherov,et al.  Multiseed lossless filtration , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Files for Figures,et al.  Genes Regulated Cooperatively By One or More Transcription Factors and Their Identification in Whole Eukaryotic Genomes , 1998 .

[13]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[14]  Richard Arratia,et al.  Central Limit Theorem from Renewal Theory for Several Patterns , 1997, J. Comput. Biol..

[15]  Martha L Bulyk,et al.  DNA microarray technologies for measuring protein-DNA interactions. , 2006, Current opinion in biotechnology.

[16]  Yuh-Jyh Hu,et al.  Combinatorial motif analysis and hypothesis generation on a genomic scale , 2000, Bioinform..

[17]  J. Lengyel,et al.  Control of tailless expression by bicoid, dorsal and synergistically interacting terminal system regulatory elements , 1993, Mechanisms of Development.

[18]  Jean-Jacques Daudin,et al.  Exact distribution of word occurrences in a random sequence of letters , 1999, Journal of Applied Probability.

[19]  Philippe Flajolet,et al.  Motif statistics , 1999, Theor. Comput. Sci..

[20]  Pierre Nicodème,et al.  Regexpcount, a symbolic package for counting problems on regular expressions and words , 2000, Fundam. Informaticae.

[21]  Thomas Werner,et al.  Regulatory modules shared within gene classes as well as across gene classes can be detected by the same in silico approach , 2000, Silico Biol..

[22]  Anna G. Nazina,et al.  Distance preferences in the arrangement of binding motifs and hierarchical levels in organization of transcription regulatory information. , 2003, Nucleic acids research.

[23]  Ernest Fraenkel,et al.  Practical Strategies for Discovering Regulatory DNA Sequence Motifs , 2006, PLoS Comput. Biol..

[24]  James W Carman,et al.  Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. , 2002, Genome research.

[25]  L. Gold,et al.  Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. , 1990, Science.

[26]  A. Mikaelyan,et al.  Constructive Synergism of Regulatory Genes Expressed in the Course of Eye and Muscle Development and Regeneration , 2001, Biology Bulletin of the Russian Academy of Sciences.

[27]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[28]  S. Salzberg,et al.  Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura , 2004, Genome Biology.

[29]  O. Berg,et al.  Selection of DNA binding sites by regulatory proteins. Functional specificity and pseudosite competition. , 1988, Journal of biomolecular structure & dynamics.

[30]  O. Chrysaphinou,et al.  The Occurrence of Sequence Patterns in Repeated Dependent Experiments , 1991 .

[31]  Maude Pupin,et al.  Detecting Localized Repeats in Genomic Sequences: A New Strategy and Its Application to Bacillus Subtilis and Arabidopsis Thaliana Sequences , 2000, Comput. Chem..

[32]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory A.

[33]  Eric H Davidson,et al.  New computational approaches for analysis of cis-regulatory networks. , 2002, Developmental biology.

[34]  Peter W. Markstein,et al.  Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Peter W. Markstein,et al.  A regulatory code for neurogenic gene expression in the Drosophila embryo , 2004, Development.

[36]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[37]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[38]  C. Desplan,et al.  Cooperative interactions between paired domain and homeodomain. , 1996, Development.

[39]  William Krivan Searching for Transcription Factor Binding Site Clusters: How True Are True Positives? , 2004, J. Bioinform. Comput. Biol..

[40]  Mireille Régnier,et al.  Assessing the Significance of Sets of Words , 2005, CPM.

[41]  Dmitri A. Papatsenko,et al.  ClusterDraw web server: a tool to identify and visualize clusters of binding motifs for transcription factors , 2007, Bioinform..

[42]  Bart De Moor,et al.  Computational detection of cis-regulatory modules , 2003, ECCB.

[43]  William Stafford Noble,et al.  Searching for statistically significant regulatory modules , 2003, ECCB.

[44]  加藤 護 Identifying combinatorial regulation of transcription factors and binding motifs , 2004 .

[45]  A. Wagner,et al.  A computational genomics approach to the identification of gene networks. , 1997, Nucleic acids research.

[46]  Yunlong Liu,et al.  Modeling Transcriptional Regulation in Chondrogenesis Using Particle Swarm Optimization , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[47]  Dmitri Papatsenko,et al.  A self-organizing system of repressor gradients establishes segmental complexity in Drosophila , 2003, Nature.

[48]  John L. Spouge,et al.  Searching for Multiple Words in a Markov Sequence , 2004, INFORMS J. Comput..

[49]  Eytan Domany,et al.  Finding Motifs in Promoter Regions , 2005, J. Comput. Biol..

[50]  Gregory Kucherov,et al.  Multi-seed Lossless Filtration (Extended Abstract) , 2004, CPM.

[51]  J. Shendure,et al.  Discovering functional transcription-factor combinations in the human cell cycle. , 2005, Genome research.

[52]  Wojciech Szpankowski,et al.  Average Case Analysis of Algorithms on Sequences: Szpankowski/Average , 2001 .

[53]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[54]  Martin C. Frith,et al.  Cluster-Buster: finding dense clusters of motifs in DNA sequences , 2003, Nucleic Acids Res..

[55]  M. Levine,et al.  Regulation of even‐skipped stripe 2 in the Drosophila embryo. , 1992, The EMBO journal.

[56]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[57]  Julien Clément,et al.  Counting occurrences for a finite set of words: an inclusion-exclusion approach , 2007 .

[58]  Mathieu Blanchette,et al.  Separating real motifs from their artifacts , 2001, ISMB.

[59]  Marc S Halfon,et al.  Exploring genetic regulatory networks in metazoan development: methods and models. , 2002, Physiological genomics.

[60]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[61]  Anna G. Nazina,et al.  Extraction of functional binding sites from unique regulatory regions: the Drosophila early developmental enhancers. , 2002, Genome research.

[62]  Michael Q. Zhang,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl662 Sequence analysis Computing exact P-values for DNA motifs , 2022 .

[63]  Eric D Siggia,et al.  Identification of the binding sites of regulatory proteins in bacterial genomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[64]  Anna G. Nazina,et al.  Homotypic regulatory clusters in Drosophila. , 2003, Genome research.

[65]  Martin C. Frith,et al.  Detection of cis -element clusters in higher eukaryotic DNA , 2001, Bioinform..

[66]  Barry Honig,et al.  Target Explorer: an automated tool for the identification of new target genes for a specified set of transcription factors , 2003, Nucleic Acids Res..

[67]  Mark Rebeiz,et al.  SCORE: A computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[68]  Mireille Régnier,et al.  Comparison of Statistical Significance Criteria , 2006, J. Bioinform. Comput. Biol..

[69]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..