A graph approach to the threshold all-against-all substring matching problem

We present a novel graph model and an efficient algorithm for solving the “threshold all against all” problem, which involves searching two strings (with length <i>M</i> and <i>N</i>, respectively) for all maximal approximate substring matches of length at least <i>S</i>, with up to <i>K</i> differences. Our algorithm solves the problem in time <i>O</i>(<i>MNK</i><sub>3</sub>), which is a considerable improvement over the previous known bound for this problem. We also provide experimental evidence that, in practice, our algorithm exhibits a better performance than its worst-case running time.

[1]  Ömer Egecioglu,et al.  A new approach to sequence comparison: normalized sequence alignment , 2001, RECOMB.

[2]  Emily Rocke Using Suffix Trees for Gapped Motif Discovery , 2000, CPM.

[3]  Esko Ukkonen,et al.  Pattern Discovery in Biosequences , 1998, ICGI.

[4]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[5]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6]  David Eppstein,et al.  Sparse dynamic programming I: linear cost functions , 1992, JACM.

[7]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[8]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[9]  João Meidanis,et al.  Determining DNA Sequence Similarity Using Maximum Independent Set Algorithms for Interval Graphs , 1992, SWAT.

[10]  David Eppstein,et al.  Sparse dynamic programming , 1990, SODA '90.

[11]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[12]  David Eppstein,et al.  Sparse dynamic programming II: convex and concave cost functions , 1992, JACM.

[13]  Alex Thomo,et al.  A New Algorithm for Fast All-Against-All Substring Matching , 2006, SPIRE.

[14]  GalilZvi,et al.  Sparse dynamic programming II , 1992 .

[15]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[16]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2005, RECOMB.

[17]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[18]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[19]  Joseph B. Kruskal,et al.  Time Warps, String Edits, and Macromolecules , 1999 .

[20]  Enno Ohlebusch,et al.  Chaining algorithms for multiple genome comparison , 2005, J. Discrete Algorithms.

[21]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[22]  Eugene W. Myers,et al.  Efficient q-Gram Filters for Finding All epsilon-Matches over a Given Length , 2006, J. Comput. Biol..

[23]  Gaston H. Gonnet,et al.  A fast algorithm on average for all-against-all sequence matching , 1999, 6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268).

[24]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[25]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[26]  Ömer Egecioglu,et al.  A new approach to sequence comparison: normalized sequence alignment , 2001, Bioinform..

[27]  Kurt Mehlhorn,et al.  A branch-and-cut algorithm for multiple sequence alignment , 1997, RECOMB '97.

[28]  Gad M. Landau,et al.  Sparse Normalized Local Alignment , 2004, Algorithmica.

[29]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[30]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.