An efficient algorithm for the identification of structured motifs in DNA promoter sequences

We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the data set sequences. This type of conserved regions, called structured motifs, is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The complexity analysis shows a time and space gain over the best known exact algorithms that is exponential in the spacings between binding sites. A full implementation of the algorithm was developed and made available online. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than four orders of magnitude. The application of the method to biological data sets shows its ability to extract relevant consensi

[1]  M. Crochemore,et al.  Motifs in Sequences: Localization and Extraction , 2004 .

[2]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[3]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[4]  E. Davidson,et al.  Modular cis-regulatory organization of developmentally expressed genes: two genes transcribed territorially in the sea urchin embryo, and additional examples. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[5]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[6]  M. Sagot,et al.  Promoter sequences and algorithmical methods for identifying them. , 1999, Research in microbiology.

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[9]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[10]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[11]  Mikhail S. Gelfand,et al.  Genome-Wide Analysis of Bacterial Promoter Regions , 2002, Pacific Symposium on Biocomputing.

[12]  Nicola Vitacolonna,et al.  Structured motifs search. , 2005, Journal of computational biology : a journal of computational molecular cell biology.

[13]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[14]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[15]  J. D. Helmann,et al.  Compilation and analysis of Bacillus subtilis sigma A-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA , 1995, Nucleic Acids Res..

[16]  Roded Sharan,et al.  A discriminative model for identifying spatial cis-regulatory modules , 2004, J. Comput. Biol..

[17]  Joong Chae Na,et al.  Truncated suffix trees and their application to data compression , 2003, Theor. Comput. Sci..

[18]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[19]  P. Schjerling,et al.  Comparative amino acid sequence analysis of the C6 zinc cluster family of transcriptional regulators. , 1996, Nucleic acids research.

[20]  T. Werner Models for prediction and recognition of eukaryotic promoters , 1999, Mammalian Genome.

[21]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[22]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[23]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[24]  Nicola Vitacolonna,et al.  Structured motifs search , 2004, J. Comput. Biol..

[25]  Roded Sharan,et al.  CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments , 2003, ISMB.

[26]  S. Busby,et al.  Transcription activation at class II CRP-dependent promoters: the role of different activating regions. , 1997, Nucleic acids research.

[27]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999, Softw. Pract. Exp..

[28]  P. Bucher,et al.  Searching for regulatory elements in human noncoding sequences. , 1997, Current opinion in structural biology.

[29]  Julien Allali,et al.  Comparaison de structures secondaires d'ARN , 2004 .

[30]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[31]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[32]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[33]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[34]  Marie-France Sagot,et al.  A highly scalable algorithm for the extraction of CIS-regulatory regions , 2005, APBC.

[35]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[36]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .