On the Common Substring Alignment Problem

The Common Substring Alignment Problem is defined as follows: Given a set of one or more strings S1,S2?Sc and a target string T, Y is a common substring of all strings Si, that is, Si=BiYFi. The goal is to compute the similarity of all strings Si with T, without computing the part of Y again and again. Using the classical dynamic programming tables, each appearance of Y in a source string would require the computation of all the values in a dynamic programming table of size O(n?) where ? is the size of Y. Here we describe an algorithm which is composed of an encoding stage and an alignment stage. During the first stage, a data structure is constructed which encodes the comparison of Y with T. Then, during the alignment stage, for each comparison of a source Si with T, the pre-compiled data structure is used to speed up the part of Y. We show how to reduce the O(n?) alignment work, for each appearance of the common substring Y in a source string, to O(n)-at the cost of O(n?) encoding work, which is executed only once.

[1]  Gad M. Landau,et al.  On the shared substring alignment problem , 2000, SODA '00.

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[4]  Pavel A. Pevzner,et al.  Las Vegas Algorithms for Gene Recognition: Suboptimal and Error-Tolerant Spliced Alignment , 1997, J. Comput. Biol..

[5]  Sampath Kannan,et al.  An Algorithm for Locating Nonoverlapping Regions of Maximum Alignment Score , 1996, SIAM J. Comput..

[6]  Mikhail J. Atallah,et al.  Efficient Parallel Algorithms for String Editing and Related Problems , 1990, SIAM J. Comput..

[7]  Gary Benson A Space Efficient Algorithm for Finding the Best Nonoverlapping Alignment Score , 1995, Theor. Comput. Sci..

[8]  Maurice D. Mulvenna,et al.  Discovering Internet marketing intelligence through online analytical web usage mining , 1998, SGMD.

[9]  Raffaele Giancarlo Dynamic programming: special cases , 1997, Pattern Matching Algorithms.

[10]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[11]  Alok Aggarwal,et al.  Notes on searching in multidimensional monotone arrays , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[12]  Alok Aggarwal,et al.  Geometric applications of a matrix-searching algorithm , 1987, SCG '86.

[13]  Gonzalo Navarro,et al.  Approximate String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[14]  David Eppstein,et al.  Speeding up dynamic programming , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[15]  Biswanath Mukherjee,et al.  A system for distributed intrusion detection , 1991, COMPCON Spring '91 Digest of Papers.

[16]  Raffaele Giancarlo,et al.  Speeding up Dynamic Programming with Applications to Molecular Biology , 1989, Theor. Comput. Sci..

[17]  Mikhail S. Gelfand,et al.  Combinatorial Approaches to Gene Recognition , 1997, Comput. Chem..

[18]  P A Pevzner,et al.  Performance-guarantee gene predictions via spliced alignment. , 1998, Genomics.

[19]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[20]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[21]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[22]  Sampath Kannan,et al.  An Algorithm for Locating Non-Overlapping Regions of Maximum Alignment Score , 1993, CPM.

[23]  Jeanette P. Schmidt,et al.  All Highest Scoring Paths in Weighted Grid Graphs and Their Application to Finding All Approximate Repeats in Strings , 1998, SIAM J. Comput..

[24]  Philip S. Yu,et al.  Data mining for path traversal patterns in a web environment , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[25]  Robert G. Farrell,et al.  A task-based architecture for application-aware adjuncts , 2000, IUI '00.

[26]  Mahesh K. Marina,et al.  Performance of route caching strategies in Dynamic Source Routing , 2001, Proceedings 21st International Conference on Distributed Computing Systems Workshops.

[27]  Rakefet Rosenfeld Calculating the secrets of life , 1995, Nature.

[28]  Raffaele Giancarlo,et al.  Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching , 2000 .

[29]  Robert E. Tarjan,et al.  A linear-time algorithm for a special case of disjoint set union , 1983, J. Comput. Syst. Sci..

[30]  Zvi Galil,et al.  Proceedings of the 30th IEEE symposium on Foundations of computer science , 1994, FOCS 1994.

[31]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Lawrence L. Larmore,et al.  The least weight subsequence problem , 1987, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[33]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.