Sparse LCS Common Substring Alignment

The "Common Substring Alignment" problem is defined as follows. The input consists of a set of strings S1, S2 ... Sc, with a common substring appearing at least once in each of them, and a target string T. The goal is to compute similarity of all strings Si with T, without computing the part of the common substring over and over again. In this paper we consider the Common Substring Alignment problem for the LCS (Longest Common Subsequence) similarity metric. Our algorithm gains its efficiency by exploiting the sparsity inherent to the LCS problem. Let Y be the common substring, n be the size of the compared sequences, Ly be the length of the LCS of T and Y, denoted |LCS[T, Y]|, and L be max{|LCS[T,Si]|}. Our algorithm consists of an O(nLy) time encoding stage that is executed once per common substring, and an O(L) time alignment stage that is executed once for each appearance of the common substring in each source string. The additional running time depends only on the length of the parts of the strings that are not in any common substring.

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  Gary Benson A Space Efficient Algorithm for Finding the Best Nonoverlapping Alignment Score , 1995, Theor. Comput. Sci..

[3]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[4]  Alberto Apostolico,et al.  String Editing and Longest Common Subsequences , 1997, Handbook of Formal Languages.

[5]  David Eppstein,et al.  Sparse dynamic programming I: linear cost functions , 1992, JACM.

[6]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[7]  Alok Aggarwal,et al.  Geometric applications of a matrix-searching algorithm , 1987, SCG '86.

[8]  Gad M. Landau,et al.  On the Common Substring Alignment Problem , 2001, J. Algorithms.

[9]  Jeanette P. Schmidt,et al.  All Highest Scoring Paths in Weighted Grid Graphs and Their Application to Finding All Approximate Repeats in Strings , 1998, SIAM J. Comput..

[10]  Sampath Kannan,et al.  An Algorithm for Locating Nonoverlapping Regions of Maximum Alignment Score , 1996, SIAM J. Comput..

[11]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[12]  Gad M. Landau,et al.  On the shared substring alignment problem , 2000, SODA '00.

[13]  David Eppstein,et al.  Sparse dynamic programming II: convex and concave cost functions , 1992, JACM.

[14]  Alberto Apostolico,et al.  The longest common subsequence problem revisited , 1987, Algorithmica.

[15]  Sampath Kannan,et al.  An Algorithm for Locating Non-Overlapping Regions of Maximum Alignment Score , 1993, CPM.

[16]  Gad M. Landau,et al.  A sub-quadratic sequence alignment algorithm for unrestricted cost matrices , 2002, SODA '02.