An improved algorithm for the longest common subsequence problem

The Longest Common Subsequence problem seeks a longest subsequence of every member of a given set of strings. It has applications, among others, in data compression, FPGA circuit minimization, and bioinformatics. The problem is NP-hard for more than two input strings, and the existing exact solutions are impractical for large input sizes. Therefore, several approximation and (meta) heuristic algorithms have been proposed which aim at finding good, but not necessarily optimal, solutions to the problem. In this paper, we propose a new algorithm based on the constructive beam search method. We have devised a novel heuristic, inspired by the probability theory, intended for domains where the input strings are assumed to be independent. Special data structures and dynamic programming methods are developed to reduce the time complexity of the algorithm. The proposed algorithm is compared with the state-of-the-art over several standard benchmarks including random and real biological sequences. Extensive experimental results show that the proposed algorithm outperforms the state-of-the-art by giving higher quality solutions with less computation time for most of the experimental cases.

[1]  Francis Y. L. Chin,et al.  Performance analysis of some simple heuristics for computing longest common subsequences , 1994, Algorithmica.

[2]  Christian Blum,et al.  Probabilistic Beam Search for the Longest Common Subsequence Problem , 2007, SLS.

[3]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[4]  Robert W. Irving,et al.  Two Algorithms for the Longest Common Subsequence of Three (or More) Strings , 1992, CPM.

[5]  Chang-Biau Yang,et al.  Fast Algorithms for Finding the Common Subsequence of Multiple Sequences , 2004 .

[6]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1995, SIAM J. Comput..

[7]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[8]  M. W. Du,et al.  Computing a longest common subsequence for a set of strings , 1984, BIT.

[9]  R. Ravi,et al.  Computing Similarity between RNA Strings , 1996, CPM.

[10]  Bin Ma,et al.  A General Edit Distance between RNA Structures , 2002, J. Comput. Biol..

[11]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[12]  David Eppstein,et al.  Sparse dynamic programming II: convex and concave cost functions , 1992, JACM.

[13]  Cameron Bruce Fraser,et al.  Subsequences and Supersequences of Strings , 1995 .

[14]  Todd Easton,et al.  A Specialized Branching and Fathoming Technique for The Longest Common Subsequence Problem , 2007 .

[15]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[16]  Todd Easton,et al.  A large neighborhood search heuristic for the longest common subsequence problem , 2008, J. Heuristics.

[17]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[18]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[19]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[20]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[21]  Paola Bonizzoni,et al.  Experimenting an approximation algorithm for the LCS , 2001, Discret. Appl. Math..

[22]  Arindam Banerjee,et al.  Clickstream clustering using weighted longest common subsequences , 2001 .

[23]  Hiroshi Imai,et al.  The Longest Common Subsequence Problem for Small Alphabet Size Between Many Strings , 1992, ISAAC.

[24]  Alain Guénoche,et al.  Supersequences of Masks for Oligo-chips , 2004, J. Bioinform. Comput. Biol..

[25]  Shyong Jian Shyu,et al.  Finding the longest common subsequence for multiple biological sequences by ant colony optimization , 2009, Comput. Oper. Res..

[26]  Manuel López-Ibáñez,et al.  Beam search for the longest common subsequence problem , 2009, Comput. Oper. Res..

[27]  Majid Sarrafzadeh,et al.  Area-efficient instruction set synthesis for reconfigurable system-on-chip designs , 2004, Proceedings. 41st Design Automation Conference, 2004..

[28]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[29]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.