Semi-local String Comparison: Algorithmic Techniques and Applications

Abstract.Given two strings, the longest common subsequence (LCS) problem consists in computing the length of the longest string that is a subsequence of both input strings. Its generalisation, the all semi-local LCS problem, requires computing the LCS length for each string against all substrings of the other string, and for all prefixes of each string against all suffixes of the other string. We survey a number of algorithmic techniques related to the all semi-local LCS problem. We then present a number of algorithmic applications of these techniques, both existing and new. In particular, we obtain a new all semi-local LCS algorithm, with asymptotic running time matching (in the case of an unbounded alphabet) the fastest known global LCS algorithm by Masek and Paterson. We conclude that semi-local string comparison turns out to be a useful algorithmic plug-in, which unifies, and often improves on, a number of previous approaches to various substring- and subsequence-related problems.

[1]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[2]  Horst Bunke,et al.  Classes of cost functions for string edit distance , 2006, Algorithmica.

[3]  Maxime Crochemore,et al.  A fast and practical bit-vector algorithm for the Longest Common Subsequence problem , 2001, Inf. Process. Lett..

[4]  Toshinobu Kashiwabara,et al.  Efficient algorithms for finding maximum cliques of an overlap graph , 1990, Networks.

[5]  Sergio Barrachina,et al.  Speeding up the computation of the edit distance for cyclic strings , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[6]  Jean-Paul Comet Application of Max-Plus algebra to biological sequence comparisons , 2003, Theor. Comput. Sci..

[7]  JE-OK CHOI,et al.  THE REPRESENTATIONS OF THE SYMMETRIC GROUP , 2010 .

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  Gerard Zwaan,et al.  A Taxonomy of Sublinear Multiple Keyword Pattern Matching Algorithms , 1996, Sci. Comput. Program..

[10]  Edson Cáceres,et al.  An all-substrings common subsequence algorithm , 2008, Discret. Appl. Math..

[11]  David WheelerVersion,et al.  Pairwise Sequence Alignment , 2007 .

[12]  Horst Bunke,et al.  Applications of approximate string matching to 2D shape recognition , 1993, Pattern Recognit..

[13]  Joseph JáJá,et al.  Space-Efficient and Fast Algorithms for Multidimensional Dominance Reporting and Counting , 2004, ISAAC.

[14]  Philip Bille,et al.  Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts , 2009, TALG.

[15]  Gary Benson Tandem Cyclic Alignment , 2001, CPM.

[16]  Gad M. Landau,et al.  A Unified Algorithm for Accelerating Edit-Distance Computation via Text-Compression , 2009, STACS.

[17]  Heikki Hyyro Bit-Parallel LCS-length Computation Revisited , 2004 .

[18]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[19]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[20]  Ömer Egecioglu,et al.  A new approach to sequence comparison: normalized sequence alignment , 2001, RECOMB.

[21]  Alejandro López-Ortiz,et al.  Longest increasing subsequences in sliding windows , 2004, Theor. Comput. Sci..

[22]  Philip N. Klein,et al.  Multiple-source shortest paths in planar graphs , 2005, SODA '05.

[23]  Alberto Apostolico Remark on the Hsu-Du New Algorithm for the Longest Common Subsequence Problem , 1987, Inf. Process. Lett..

[24]  Rainer E. Burkard,et al.  Monge properties, discrete convexity and applications , 2007, Eur. J. Oper. Res..

[25]  P. Butkovic Max-linear Systems: Theory and Algorithms , 2010 .

[26]  R. Möhring Algorithmic graph theory and perfect graphs , 1986 .

[27]  Kim R. Rasmussen,et al.  Efficient q-Gram Filters for Finding All-Matches Over a Given Length , 2005 .

[28]  Nicolas Bourbaki,et al.  Groupes et algèbres de Lie , 1971 .

[29]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[30]  Jorge Urrutia,et al.  Finding maximum cliques in circle graphs , 1981, Networks.

[31]  Mehmet M. Dalkilic,et al.  High-Performance Direct Pairwise Comparison of Large Genomic Sequences , 2006, IEEE Transactions on Parallel and Distributed Systems.

[32]  Eugene W. Myers,et al.  A subquadratic algorithm for approximate limited expression matching , 2005, Algorithmica.

[33]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[34]  Anatoly M. Vershik,et al.  Statistical Properties of Locally Free Groups¶with Applications to Braid Groups¶and Growth of Random Heaps , 2000 .

[35]  Anne Schilling,et al.  On the representation theory of finite J-trivial monoids , 2010, 1010.3455.

[36]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[37]  Markus Lohrey,et al.  Querying and Embedding Compressed Texts , 2006, MFCS.

[38]  M. Golummc Algorithmic graph theory and perfect graphs , 1980 .

[39]  Mikhail J. Atallah,et al.  Constructing trees in parallel , 1989, SPAA '89.

[40]  Gary Benson A Space Efficient Algorithm for Finding the Best Nonoverlapping Alignment Score , 1995, Theor. Comput. Sci..

[41]  Knut Reinert,et al.  Biological Sequence Analysis Using the SeqAn C++ Library , 2009, Chapman and Hall / CRC mathematical and computational biology series.

[42]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[43]  J. Maizel,et al.  Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Graham Cormode,et al.  The string edit distance matching problem with moves , 2002, SODA '02.

[45]  Donald E. Knuth,et al.  PERMUTATIONS, MATRICES, AND GENERALIZED YOUNG TABLEAUX , 1970 .

[46]  M. Golumbic Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57) , 2004 .

[47]  Eugene W. Myers,et al.  An O(NP) Sequence Comparison Algorithm , 1990, Inf. Process. Lett..

[48]  Yuri V. Matiyasevich,et al.  Multiple serial episode matching , 2006, Inf. Process. Lett..

[49]  Dan Gusfield,et al.  Parametric optimization of sequence alignment , 1992, SODA '92.

[50]  Gad M. Landau,et al.  A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices , 2003, SIAM J. Comput..

[51]  Mikhail J. Atallah,et al.  New clique and independent set algorithms for circle graphs , 1992, Discret. Appl. Math..

[52]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[53]  Mikhail J. Atallah,et al.  Efficient Parallel Algorithms for String Editing and Related Problems , 1990, SIAM J. Comput..

[54]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[55]  Guillaume Fertin,et al.  Common Structured Patterns in Linear Graphs: Approximation and Combinatorics , 2007, CPM.

[56]  Alfred V. Aho,et al.  Bounds on the Complexity of the Longest Common Subsequence Problem , 1976, J. ACM.

[57]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[58]  Dimitrios Gunopulos,et al.  Episode Matching , 1997, CPM.

[59]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[60]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[61]  Michael L. Fredman,et al.  On computing the length of longest increasing subsequences , 1975, Discret. Math..

[62]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[63]  Maxime Crochemore,et al.  Fast computation of a longest increasing subsequence and application , 2010, Inf. Comput..

[64]  Sebastian Deorowicz An algorithm for solving the longest increasing circular subsequence problem , 2009, Inf. Process. Lett..

[65]  Ling Zhang,et al.  Rapid and sensitive dot-matrix methods for genome analysis , 2004, Bioinform..

[66]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[67]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[68]  Joel I. Seiferas,et al.  Sorting Networks of Logarithmic Depth, Further Simplified , 2009, Algorithmica.

[69]  Bangming Deng,et al.  Finite Dimensional Algebras and Quantum Groups , 2008 .

[70]  Miklós Bóna,et al.  Combinatorics of permutations , 2022, SIGA.

[71]  Alexander Tiskin,et al.  String comparison by transposition networks , 2009, ArXiv.

[72]  Wen-Lian Hsu,et al.  Maximum Weight Clique Algorithms for Circular-Arc Graphs and Circle Graphs , 1985, SIAM J. Comput..

[73]  Maw-Shang Chang,et al.  Efficient Algorithms for the Maximum Weight Clique and Maximum Weight Independent Set Problems on Permutation Graphs , 1992, Inf. Process. Lett..

[74]  Alexander Tiskin,et al.  Evolutionary analysis of regulatory sequences (EARS) in plants. , 2010, The Plant journal : for cell and molecular biology.

[75]  Alexander Yong,et al.  Stable Grothendieck polynomials and K-theoretic factor sequences , 2005 .

[76]  Bernard Chazelle,et al.  A Functional Approach to Data Structures and Its Use in Multidimensional Searching , 1988, SIAM J. Comput..

[77]  Gad M. Landau Can dist tables be merged in linear time - An Open Problem , 2006, Stringology.

[78]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[79]  Thomas Rattei,et al.  Gepard: a rapid and sensitive tool for creating dotplots on genome scale , 2007, Bioinform..

[80]  Ömer Egecioglu,et al.  Approximation Algorithms for Local Alignment with Length Constraints , 2002, Int. J. Found. Comput. Sci..

[81]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[82]  Robert E. Tarjan,et al.  A linear-time algorithm for a special case of disjoint set union , 1983, J. Comput. Syst. Sci..

[83]  V. N. Remeslennikov,et al.  DIVISIBILITY THEORY AND COMPLEXITY OF ALGORITHMS FOR FREE PARTIALLY COMMUTATIVE GROUPS , 2005 .

[84]  Jon Louis Bentley,et al.  Multidimensional divide-and-conquer , 1980, CACM.

[85]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[86]  Yury Lifshits,et al.  Window Subsequence Problems for Compressed Texts , 2006, CSR.

[87]  Alexander Tiskin Faster exon assembly by sparse spliced alignment , 2007, ArXiv.

[88]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[89]  Sergey Bereg,et al.  Enumerating longest increasing subsequences and patience sorting , 2000, Inf. Process. Lett..

[90]  Alexander Tiskin All Semi-local Longest Common Subsequences in Subquadratic Time , 2006, CSR.

[91]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[92]  Sean Gerrish,et al.  Two New Criteria for Comparison in the Bruhat Order , 2004, Electron. J. Comb..

[93]  Yury Lifshits,et al.  Processing Compressed Texts: A Tractability Border , 2007, CPM.

[94]  Guillaume Fertin,et al.  Finding common structured patterns in linear graphs , 2010, Theor. Comput. Sci..

[95]  Alexander Tiskin Faster subsequence recognition in compressed strings , 2007, ArXiv.

[96]  Alexander Tiskin Longest Common Subsequences in Permutations and Maximum Cliques in Circle Graphs , 2006, CPM.

[97]  Mike Paterson,et al.  Longest Common Subsequences , 1994, MFCS.

[98]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[99]  Jeanette P. Schmidt,et al.  All Highest Scoring Paths in Weighted Grid Graphs and Their Application to Finding All Approximate Repeats in Strings , 1998, SIAM J. Comput..

[100]  Gonzalo Navarro,et al.  Increased Bit-Parallelism for Approximate String Matching , 2004, WEA.

[101]  Claus Rick Simple and fast linear space computation of longest common subsequences , 2000, Inf. Process. Lett..

[102]  János Komlós,et al.  Sorting in c log n parallel sets , 1983, Comb..

[103]  Gad M. Landau,et al.  Two algorithms for LCS Consecutive Suffix Alignment , 2007, J. Comput. Syst. Sci..

[104]  Jaroslav Opatrny,et al.  Longest subsequences in permutations , 2003, Australas. J Comb..

[105]  G. Rote Path Problems in Graphs , 1990 .

[106]  Edsger W. Dijkstra,et al.  Some beautiful arguments using mathematical induction , 2004, Acta Informatica.

[107]  Sampath Kannan,et al.  An Algorithm for Locating Nonoverlapping Regions of Maximum Alignment Score , 1996, SIAM J. Comput..

[108]  Yuriy Fofanov,et al.  A computational tool for the genomic identification of regions of unusual compositional properties and its utilization in the detection of horizontally transferred sequences. , 2006, Molecular biology and evolution.

[109]  Eugene W. Myers,et al.  Approximately Matching Context-Free Languages , 1995, Inf. Process. Lett..

[110]  A. Itai,et al.  QUEUES, STACKS AND GRAPHS , 1971 .

[111]  Philip Bille,et al.  Matching subsequences in trees , 2009, J. Discrete Algorithms.

[112]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[113]  Nicola Santoro,et al.  On the longest increasing subsequence of a circular list , 2007, Inf. Process. Lett..

[114]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[115]  G. de B. Robinson,et al.  On the Representations of the Symmetric Group , 1938 .

[116]  Srinivas Aluru,et al.  Pairwise Sequence Alignments , 2005 .

[117]  Hao Yuan,et al.  Longest increasing subsequences in windows based on canonical antichain partition , 2007, Theor. Comput. Sci..

[118]  M. W. Du,et al.  New Algorithms for the LCS Problem , 1984, J. Comput. Syst. Sci..

[119]  A. Gibbs,et al.  The Diagram, a Method for Comparing Sequences , 1970 .

[120]  Alberto Apostolico,et al.  Fast Linear-Space Computations of Longest Common Subsequences , 1992, Theor. Comput. Sci..

[121]  Giuseppe F. Italiano,et al.  Topics in Data Structures , 2010, Algorithms and Theory of Computation Handbook.

[122]  Trevor I. Dix,et al.  A Bit-String Longest-Common-Subsequence Algorithm , 1986, Inf. Process. Lett..

[123]  Dominique de Werra,et al.  A tutorial on the use of graph coloring for some problems in robotics , 2009, Eur. J. Oper. Res..

[124]  R. Durbin,et al.  A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. , 1995, Gene.

[125]  Rainer E. Burkard,et al.  Perspectives of Monge Properties in Optimization , 1996, Discret. Appl. Math..

[126]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[127]  Alok Aggarwal,et al.  Geometric applications of a matrix-searching algorithm , 1987, SCG '86.

[128]  M. Maes,et al.  On a Cyclic String-To-String Correction Problem , 1990, Inf. Process. Lett..

[129]  Alexander Tiskin,et al.  Semi-local longest common subsequences in subquadratic time , 2008, J. Discrete Algorithms.

[130]  Alexander Tiskin,et al.  Periodic String Comparison , 2009, CPM.

[131]  Alberto Apostolico,et al.  The longest common subsequence problem revisited , 1987, Algorithmica.

[132]  Serafim Batzoglou,et al.  A computational model for RNA multiple structural alignment , 2006, Theor. Comput. Sci..

[133]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[134]  Wojciech Rytter Algorithms on Compressed Strings and Arrays , 1999, SOFSEM.

[135]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[136]  Erin W. Chambers,et al.  Multiple source shortest paths in a genus g graph , 2007, SODA '07.

[137]  Sergey Fomin,et al.  Noncommutative schur functions and their applications , 2006, Discret. Math..

[138]  Adrian Kosowski An Efficient Algorithm for the Longest Tandem Scattered Subsequence Problem , 2004, SPIRE.

[139]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[140]  E. Myers,et al.  Approximate matching of regular expressions , 1989 .

[141]  Borivoj Melichar,et al.  Directed acyclic subsequence graph - Overview , 2003, J. Discrete Algorithms.

[142]  Wolfgang W. Bein,et al.  A CHARACTERIZATION OF THE MONGE PROPERTY AND ITS CONNECTION TO STATISTICS , 1996 .

[143]  Ayumi Shinohara,et al.  Fully Incremental LCS Computation , 2005, FCT.

[144]  Mike Paterson,et al.  Improved sorting networks withO(logN) depth , 1990, Algorithmica.

[145]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[146]  S. V. Tsaranov Representation and Classification of Coxeter Monoids , 1990, Eur. J. Comb..

[147]  Gad M. Landau,et al.  On the Common Substring Alignment Problem , 2001, J. Algorithms.

[148]  Max Crochemore,et al.  Algorithms and Theory of Computation Handbook , 2010 .

[149]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[150]  Ayumi Shinohara,et al.  Collage system: a unifying framework for compressed pattern matching , 2003, Theor. Comput. Sci..

[151]  Gad M. Landau,et al.  On the Complexity of Sparse Exon Assembly , 2006, J. Comput. Biol..

[152]  Gad M. Landau,et al.  Re-Use Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit , 2005 .

[153]  T. A. Springer,et al.  The Bruhat order on symmetric varieties , 1990 .

[154]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[155]  E. Szemerédi,et al.  O(n LOG n) SORTING NETWORK. , 1983 .

[156]  Jacobus H. Koolen,et al.  On Line Arrangements in the Hyperbolic Plane , 2002, Eur. J. Comb..

[157]  Yoshifumi Sakai An Almost Quadratic Time Algorithm for Sparse Spliced Alignment , 2009, Theory of Computing Systems.

[158]  Vincent A. Fischetti,et al.  Identifying Periodic Occurrences of a Template with Applications to Protein Structure , 1993, Inf. Process. Lett..

[159]  Kun-Mao Chao,et al.  Sequence Comparison - Theory and Methods , 2008, Computational Biology.

[160]  Vijaya Ramachandran,et al.  Cache-oblivious dynamic programming , 2006, SODA '06.

[161]  Philip Bille,et al.  Fast and compact regular expression matching , 2005, Theor. Comput. Sci..

[162]  Alexander Tiskin,et al.  Fast Distance Multiplication of Unit-Monge Matrices , 2010, SODA '10.

[163]  Timothy M. Chan,et al.  Counting inversions, offline orthogonal range counting, and related problems , 2010, SODA '10.

[164]  Gonzalo Navarro,et al.  Approximate string matching on Ziv-Lempel compressed text , 2003, J. Discrete Algorithms.

[165]  Fanica Gavril,et al.  Algorithms for a maximum clique and a maximum independent set of a circle graph , 1973, Networks.

[166]  G. Szekeres,et al.  A combinatorial problem in geometry , 2009 .

[167]  A. Björner,et al.  Combinatorics of Coxeter Groups , 2005 .

[168]  Raffaele Giancarlo Dynamic programming: special cases , 1997, Pattern Matching Algorithms.

[169]  Volker Strumpen,et al.  Cache oblivious stencil computations , 2005, ICS '05.

[170]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[171]  Timothy M. Chan More algorithms for all-pairs shortest paths in weighted graphs , 2007, STOC '07.

[172]  Sung-Ryul Kim,et al.  A Dynamic Edit Distance Table , 2000, CPM.

[173]  Boris Pittel,et al.  How often are two permutations comparable , 2008 .