论文信息 - Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet Σ. For instance, Σ may be equal to (A, C, G, T} and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N ≥ 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 ≤ q ≤ N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being external objects and denote them by the expression valid models if they verify the quorum constraint q. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further preprocessing, uses this tree to simply spell the models. Assuming an alphabet of fixed size, the total time needed is O(nN 2 V(e, κ)) using O(nN 2 /ω) space, where n is the (average) length of the sequence(s), k is the length of the models sought or is the length of the longest possible valid models, ω is the size of a word machine and V(e,κ) is the number of words of length κ at a Hamming distance at most e from another κ-length word. V(e,κ) may be majored by k e |Σ| e . This improves on an algorithm by Waterman [23]. It is also a better time bound than our previous approach [15] for the common motifs problem whenever N < k|Σ|, and a better space bound when N/ω < k. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in this second case are O(nV(e,k)) and O(n) respectively. Finally, we suggest how to extend these algorithms to deal with gaps.

Marie-France Sagot | M. Sagot

[1] Udi Manber,et al. Fast text searching: allowing errors , 1992, CACM.

[2] Alain Viari,et al. A Double Combinatorial Approach to Discovering Patterns in Biological Sequences , 1996, CPM.

[3] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[4] Gaston H. Gonnet,et al. A new approach to text searching , 1992, CACM.

[5] Esko Ukkonen,et al. Approximate String-Matching over Suffix Trees , 1993, CPM.

[6] Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[7] 김동규,et al. [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[8] Esko Ukkonen,et al. Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[9] M S Waterman,et al. Multiple sequence alignment by consensus. , 1986, Nucleic acids research.

[10] Maxime Crochemore,et al. An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[11] Archie L. Cobbs. Fast Identification of Approximately Matching Substrings , 1994, CPM.