A sub-quadratic sequence alignment algorithm for unrestricted cost matrices

The classical algorithm for computing the similarity between two sequences [36, 39] uses a dynamic programming matrix, and compares two strings of size <i>n</i> in <i>O</i>(<i>n</i><sup>2</sup>) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both <i>local</i> and <i>global</i> alignment computations.The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an <i>O</i>(<i>n</i><sup>2</sup>/log <i>n</i>) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually <i>O</i>(<i>hn</i><sup>2</sup>/log <i>n</i>) where <i>h</i> ≤ 1 is the entropy of the text.

[1]  Daniel J. Kleitman,et al.  An Almost Linear Time Algorithm for Generalized Matrix Searching , 1990, SIAM J. Discret. Math..

[2]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3]  Ayumi Shinohara,et al.  Speeding Up Pattern Matching by Text Compression , 2000, CIAC.

[4]  Ayumi Shinohara,et al.  Shift-And Approach to Pattern Matching in LZW Compressed Text , 1999, CPM.

[5]  Gad M. Landau,et al.  On the shared substring alignment problem , 2000, SODA '00.

[6]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[8]  Zvi Galil,et al.  A Linear-Time Algorithm for Concave One-Dimensional Dynamic Programming , 1990, Inf. Process. Lett..

[9]  Gonzalo Navarro,et al.  Boyer-Moore String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[10]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[11]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[12]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[13]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[14]  David Eppstein,et al.  Sparse dynamic programming II: convex and concave cost functions , 1992, JACM.

[15]  Raffaele Giancarlo Dynamic programming: special cases , 1997, Pattern Matching Algorithms.

[16]  Raffaele Giancarlo,et al.  Speeding up Dynamic Programming with Applications to Molecular Biology , 1989, Theor. Comput. Sci..

[17]  Gary Benson A Space Efficient Algorithm for Finding the Best Nonoverlapping Alignment Score , 1995, Theor. Comput. Sci..

[18]  Wojciech Rytter,et al.  Almost-optimal fully LZW-compressed pattern matching , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[19]  János Csirik,et al.  An Improved Algorithm for Computing the Edit Distance of Run-Length Coded Strings , 1995, Inf. Process. Lett..

[20]  Gad M. Landau,et al.  Edit distance of run-length encoded strings , 2002, Inf. Process. Lett..

[21]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[22]  Alok Aggarwal,et al.  Notes on searching in multidimensional monotone arrays , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[23]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[24]  Lawrence L. Larmore,et al.  The least weight subsequence problem , 1987, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[25]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[26]  Michael G. Main,et al.  An O(n log n) Algorithm for Finding All Repetitions in a String , 1984, J. Algorithms.

[27]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[28]  Kun-Mao Chao,et al.  Recent Developments in Linear-Space Alignment Methods: A Survey , 1994, J. Comput. Biol..

[29]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[30]  Sampath Kannan,et al.  An Algorithm for Locating Non-Overlapping Regions of Maximum Alignment Score , 1993, CPM.

[31]  Setsuo Arikawa,et al.  Faster approximate string matching over compressed text , 2001, Proceedings DCC 2001. Data Compression Conference.

[32]  Philippe Jacquet,et al.  Asymptotic Behavior of the Lempel-Ziv Parsing Scheme and Digital Search Trees , 1995, Theor. Comput. Sci..

[33]  Ayumi Shinohara,et al.  Speeding Up String Pattern Matching by Text Compression: The Dawn of a New Era , 2001 .

[34]  David Eppstein,et al.  Sequence Comparison with Mixed Convex and Concave Costs , 1990, J. Algorithms.

[35]  Mikhail J. Atallah,et al.  Efficient Parallel Algorithms for String Editing and Related Problems , 1990, SIAM J. Comput..

[36]  David Eppstein,et al.  Sparse dynamic programming I: linear cost functions , 1992, JACM.

[37]  Gad M. Landau,et al.  On the Common Substring Alignment Problem , 2001, J. Algorithms.

[38]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[39]  Gonzalo Navarro,et al.  Approximate Matching of Run-Length Compressed Strings , 2002, Algorithmica.

[40]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[41]  Maxime Crochemore,et al.  Transducers and Repetitions , 1986, Theor. Comput. Sci..

[42]  Jeanette P. Schmidt,et al.  All Highest Scoring Paths in Weighted Grid Graphs and Their Application to Finding All Approximate Repeats in Strings , 1998, SIAM J. Comput..

[43]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[44]  Sampath Kannan,et al.  An Algorithm for Locating Nonoverlapping Regions of Maximum Alignment Score , 1996, SIAM J. Comput..

[45]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[46]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[47]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[48]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[49]  Gonzalo Navarro,et al.  Approximate String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[50]  David Eppstein,et al.  Speeding up dynamic programming , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[51]  Alok Aggarwal,et al.  Geometric applications of a matrix-searching algorithm , 1987, SCG '86.

[52]  Gad M. Landau,et al.  Matching for Run-Length Encoded Strings , 1999, J. Complex..

[53]  Gonzalo Navarro,et al.  A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text , 1999, CPM.