Suffix Tree Characterization of Maximal Motifs in Biological Sequences

Finding motifs in biological sequences is one of the most intriguing problems for string algorithms designers as it is necessary to deal with approximations and this complicates the problem. Existing algorithms run in time linear with the input size. Nevertheless, the output size can be very large due to the approximation. This makes the output often unreadable, next to slowing down the inference itself. Since only a subset of the motifs, i.e. the maximal motifs, could be enough to give the information of all of them, in this paper, we aim at removing such redundancy. We define notions of maximality that we characterize in the suffix tree data structure. Given that this is used by a whole class of motifs extraction tools, we show how these tools can be modified to include the maximality requirement on the fly without changing the asymptotical complexity.

[1]  Alain Viari,et al.  Searching for flexible repeated patterns using a non-transitive similarity relation , 1995, Pattern Recognit. Lett..

[2]  Esko Ukkonen Structural Analysis of Gapped Motifs of a String , 2007, MFCS.

[3]  Marie-France Sagot,et al.  An efficient algorithm for the identification of structured motifs in DNA promoter sequences , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  M. Sagot,et al.  Promoter sequences and algorithmical methods for identifying them. , 1999, Research in microbiology.

[5]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[6]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[7]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[8]  Marie-France Sagot,et al.  A highly scalable algorithm for the extraction of CIS-regulatory regions , 2005, APBC.

[9]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[10]  M. Sagot,et al.  Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals. , 2000, Journal of molecular biology.

[11]  Gregory Kucherov,et al.  Finding Approximate Repetitions under Hamming Distance , 2001, ESA.

[12]  Marie-France Sagot,et al.  Efficient Extraction of Structured Motifs Using Box-Links , 2004, SPIRE.

[13]  Enno Ohlebusch,et al.  Computation and Visualization of Degenerate Repeats in Complete Genomes , 2000, ISMB.

[14]  Marie-France Sagot,et al.  Infering regulatory elements from a whole genome. An application to the analysis of genome of $\itemize{Helicobacter Pylori}$ $\sigma_{80}$ family of promoter signals , 1999 .

[15]  Yuan Gao,et al.  Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm , 2000, SODA '00.

[16]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[17]  Alberto Apostolico,et al.  Incremental Paradigms of Motif Discovery , 2004, J. Comput. Biol..

[18]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[19]  Marie-France Sagot,et al.  Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[20]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[21]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[22]  Marie-France Sagot,et al.  Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification , 2000, RECOMB '00.

[23]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..