Deposition and extension approach to find longest common subsequence for thousands of long sequences

The problem of finding the longest common subsequence (LCS) for an arbitrary number of sequences is a very interesting and challenging problem in computer science. This problem is NP-complete, but because of its importance, many heuristic algorithms have been proposed, such as Long Run, Expansion Algorithm and THSB. However, the performance, either in result quality or in process time, of many current heuristic algorithms deteriorates fast when the number of sequences and sequence length increase. In this paper, we have proposed a post-process heuristic algorithm for the LCS problem, the Deposition and Extension Algorithm (DEA). This algorithm first generates common subsequence by "sequence deposition" based on fine tuning of search range, and then extends this common subsequence. The algorithm is proven to generate Common Subsequences (CSs) with guaranteed lengths. The experiments on different dataset showed that the results of DEA algorithm were better than those of Long Run and Expansion Algorithm, especially on many long sequences. The algorithm also had superior efficiency both in time and memory space.

[1]  Shyong Jian Shyu,et al.  Finding the longest common subsequence for multiple biological sequences by ant colony optimization , 2009, Comput. Oper. Res..

[2]  Manuel López-Ibáñez,et al.  Beam search for the longest common subsequence problem , 2009, Comput. Oper. Res..

[3]  Mike Paterson,et al.  Longest Common Subsequences , 1994, MFCS.

[4]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[5]  Francis Y. L. Chin,et al.  Performance analysis of some simple heuristics for computing longest common subsequences , 1994, Algorithmica.

[6]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[7]  Cameron Bruce Fraser,et al.  Subsequences and Supersequences of Strings , 1995 .

[8]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[9]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Joseph B. Kruskal,et al.  Time Warps, String Edits, and Macromolecules , 1999 .

[11]  M. W. Du,et al.  New Algorithms for the LCS Problem , 1984, J. Comput. Syst. Sci..

[12]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[13]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[14]  Andrea Maggiolo-Schettini,et al.  Computable Stack Functions for Semantics of Stack Programs , 1979, J. Comput. Syst. Sci..

[15]  Paola Bonizzoni,et al.  Experimenting an approximation algorithm for the LCS , 2001, Discret. Appl. Math..

[16]  Z. Weng,et al.  A computational framework for optimal masking in the synthesis of oligonucleotide microarrays. , 2002, Nucleic acids research.

[17]  Chang-Biau Yang,et al.  Fast Algorithms for Finding the Common Subsequence of Multiple Sequences , 2004 .

[18]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1995, SIAM J. Comput..

[19]  Todd Easton,et al.  A large neighborhood search heuristic for the longest common subsequence problem , 2008, J. Heuristics.

[20]  Christian Blum,et al.  Probabilistic Beam Search for the Longest Common Subsequence Problem , 2007, SLS.