论文信息 - An efficient algorithm for finding short approximate non-tandem repeats

An efficient algorithm for finding short approximate non-tandem repeats

We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences.

Tao Jiang | Michael Kaufmann | Ezekiel F. Adebiyi

[1] Charlie Hodgman,et al. The elucidation of protein function by sequence motif analysis , 1989, Comput. Appl. Biosci..

[2] Enno Ohlebusch,et al. Computation and Visualization of Degenerate Repeats in Complete Genomes , 2000, ISMB.

[3] Marie-France Sagot,et al. Spelling Approximate Repeated or Common Motifs Using a Suffix Tree , 1998, LATIN.

[4] A. Bairoch. PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[5] Esko Ukkonen,et al. Algorithms for Approximate String Matching , 1985, Inf. Control..

[6] S. Salzberg,et al. Alignment of whole genomes. , 1999, Nucleic acids research.

[7] David Haussler,et al. Average sizes of suffix trees and DAWGs , 1989, Discret. Appl. Math..

[8] Stefan Kurtz,et al. REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[9] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .