Efficient merged longest common subsequence algorithms for similar sequences

Abstract Given a pair of merging sequences A, B and a target sequence T, the merged longest common subsequence (MLCS) problem is to find out the longest common subsequence (LCS) between sequences E ( A , B ) and T, where E ( A , B ) is obtained from merging two subsequences of A and B. In this paper, we first propose an algorithm for solving the MLCS problem in O ( n | Σ | + ( r − L + 1 ) L m ) time and O ( n | Σ | + L m ) space, where r and L denote the lengths of T and MLCS, respectively, and m and n denote the shorter and longer lengths of A and B, respectively. From the time complexity, it is clear that our algorithm is very efficient when T and E ( A , B ) are very similar. With slight modification, our algorithm can also solve another merged LCS problem variant, the block-merged LCS (BMLCS) problem, in O ( n | Σ | + ( r − L + 1 ) L δ ) time and O ( n | Σ | + L δ ) space, where δ denotes the larger number of blocks of A and B. Experimental results show that our algorithms are faster than other previously published MLCS and BMLCS algorithms for sequences with high similarities. The source codes and datasets for experiments can be found on our web site http://par.cse.nsysu.edu.tw/~mlcs/ [20] .

[1]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[2]  Hsing-Yen Ann,et al.  Dynamic programming algorithms for the mosaic longest common subsequence problem , 2007, Inf. Process. Lett..

[3]  T. K. Altheide,et al.  Comparing the human and chimpanzee genomes: Searching for needles in a haystack , 2005 .

[4]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[5]  Alberto Apostolico Remark on the Hsu-Du New Algorithm for the Longest Common Subsequence Problem , 1987, Inf. Process. Lett..

[6]  Frank K. Hwang,et al.  An almost-linear time and linear space algorithm for the longest common subsequence problem , 2005, Inf. Process. Lett..

[7]  Alberto Apostolico Improving the Worst-Case Performance of the Hunt-Szymanski Strategy for the Longest Common Subsequence of Two Strings , 1986, Inf. Process. Lett..

[8]  Hsing-Yen Ann,et al.  Efficient algorithms for finding interleaving relationship between sequences , 2008, Inf. Process. Lett..

[9]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[10]  Hsing-Yen Ann,et al.  A fast and simple algorithm for computing the longest common subsequence of run-length encoded strings , 2008, Inf. Process. Lett..

[11]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[12]  Chang-Biau Yang,et al.  Efficient Sparse Dynamic Programming for the Merged LCS Problem , 2008, BIOCOMP.

[13]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[14]  Szymon Grabowski New tabulation and sparse dynamic programming based techniques for sequence similarity problems , 2016, Discret. Appl. Math..

[15]  Nikolaus Augsten,et al.  RTED: A Robust Algorithm for the Tree Edit Distance , 2011, Proc. VLDB Endow..

[16]  Hsing-Yen Ann,et al.  Efficient Algorithms for the Longest Common Subsequence Problem with Sequential Substring Constraints , 2011, 2011 IEEE 11th International Conference on Bioinformatics and Bioengineering.

[17]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[18]  B. Birren,et al.  Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae , 2004, Nature.

[19]  M. Maes,et al.  On a Cyclic String-To-String Correction Problem , 1990, Inf. Process. Lett..

[20]  Mohammad Sohel Rahman,et al.  Effective Sparse Dynamic Programming Algorithms for Merged and Block Merged LCS Problems , 2014, J. Comput..

[21]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[22]  Jean L. Chang,et al.  Initial sequence and comparative analysis of the cat genome. , 2007, Genome research.

[23]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[24]  Sebastian Deorowicz,et al.  Bit-Parallel Algorithms for the Merged Longest Common Subsequence Problem , 2013, Int. J. Found. Comput. Sci..

[25]  Maxime Crochemore,et al.  A fast and practical bit-vector algorithm for the Longest Common Subsequence problem , 2001, Inf. Process. Lett..

[26]  Richard C. T. Lee,et al.  Edit distance for a run-length-encoded string and an uncompressed string , 2007, Inf. Process. Lett..

[27]  Sebastian Deorowicz,et al.  Bit-Parallel Algorithm for the Block Variant of the Merged Longest Common Subsequence Problem , 2013, ICMMI.

[28]  Gad M. Landau,et al.  Two Algorithms for LCS Consecutive Suffix Alignment , 2004, CPM.

[29]  B. Berger,et al.  Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction , 2000 .

[30]  Alberto Apostolico,et al.  Fast Linear-Space Computations of Longest Common Subsequences , 1992, Theor. Comput. Sci..

[31]  Yahiko Kambayashi,et al.  A longest common subsequence algorithm suitable for similar text strings , 1982, Acta Informatica.

[32]  A. Poustka,et al.  Timing and mechanism of ancient vertebrate genome duplications -- the adventure of a hypothesis. , 2005, Trends in genetics : TIG.

[33]  Karsten Hokamp,et al.  The 2R hypothesis and the human genome sequence , 2004, Journal of Structural and Functional Genomics.

[34]  C. Pandu Rangan,et al.  A linear space algorithm for the LCS problem , 2004, Acta Informatica.

[35]  Claus Rick Simple and fast linear space computation of longest common subsequences , 2000, Inf. Process. Lett..