Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet Σ. For instance, Σ may be equal to (A, C, G, T} and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N ≥ 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 ≤ q ≤ N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being external objects and denote them by the expression valid models if they verify the quorum constraint q. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further preprocessing, uses this tree to simply spell the models. Assuming an alphabet of fixed size, the total time needed is O(nN 2 V(e, κ)) using O(nN 2 /ω) space, where n is the (average) length of the sequence(s), k is the length of the models sought or is the length of the longest possible valid models, ω is the size of a word machine and V(e,κ) is the number of words of length κ at a Hamming distance at most e from another κ-length word. V(e,κ) may be majored by k e |Σ| e . This improves on an algorithm by Waterman [23]. It is also a better time bound than our previous approach [15] for the common motifs problem whenever N < k|Σ|, and a better space bound when N/ω < k. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in this second case are O(nV(e,k)) and O(n) respectively. Finally, we suggest how to extend these algorithms to deal with gaps.

[1]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[2]  Alain Viari,et al.  A Double Combinatorial Approach to Discovering Patterns in Biological Sequences , 1996, CPM.

[3]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[4]  Gaston H. Gonnet,et al.  A new approach to text searching , 1992, CACM.

[5]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[6]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[8]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[9]  M S Waterman,et al.  Multiple sequence alignment by consensus. , 1986, Nucleic acids research.

[10]  Maxime Crochemore,et al.  An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[11]  Archie L. Cobbs Fast Identification of Approximately Matching Substrings , 1994, CPM.

[12]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[13]  David Haussler,et al.  Sequence landscapes , 1986, Nucleic Acids Res..

[14]  Alain Viari,et al.  Searching for Repeated Words in a Text Allowing for Mismatches and Gaps , 1995 .

[15]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[16]  Eugene W. Myers,et al.  Identifying satellites in nucleic acid sequences , 1998, RECOMB '98.

[17]  M. Waterman,et al.  Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. , 1985, Journal of molecular biology.

[18]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[19]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[20]  M. Waterman,et al.  Pattern recognition in several sequences: consensus and alignment. , 1984, Bulletin of mathematical biology.

[21]  C Lefèvre,et al.  A fast word search algorithm for the representation of sequence similarity in genomic DNA. , 1994, Nucleic acids research.

[22]  John Riedl,et al.  Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.