A hyper-heuristic for the Longest Common Subsequence problem

The Longest Common Subsequence Problem is the problem of finding a longest string that is a subsequence of every member of a given set of strings. It has applications in FPGA circuit minimization, data compression, and bioinformatics, among others. The problem is NP-hard in its general form, which implies that no exact polynomial-time algorithm currently exists for the problem. Consequently, inexact algorithms have been proposed to obtain good, but not necessarily optimal, solutions in an affordable time. In this paper, a hyper-heuristic algorithm incorporated within a constructive beam search is proposed for the problem. The proposed hyper-heuristic is based on two basic heuristic functions, one of which is new in this paper, and determines dynamically which one to use for a given problem instance. The proposed algorithm is compared with state-of-the-art algorithms on simulated and real biological sequences. Extensive experimental reveals that the proposed hyper-heuristic is superior to the state-of-the-art methods with respect to the solution quality and the running-time.

[1]  Alain Guénoche,et al.  Supersequences of Masks for Oligo-chips , 2004, J. Bioinform. Comput. Biol..

[2]  Francis Y. L. Chin,et al.  Performance analysis of some simple heuristics for computing longest common subsequences , 1994, Algorithmica.

[3]  Christian Blum,et al.  Metaheuristics in combinatorial optimization: Overview and conceptual comparison , 2003, CSUR.

[4]  Hiroshi Imai,et al.  The Longest Common Subsequence Problem for Small Alphabet Size Between Many Strings , 1992, ISAAC.

[5]  Qingguo Wang,et al.  A Fast Heuristic Search Algorithm for Finding the Longest Common Subsequence of Multiple Strings , 2010, AAAI.

[6]  Arindam Banerjee,et al.  Clickstream clustering using weighted longest common subsequences , 2001 .

[7]  Sayyed Rasoul Mousavi,et al.  An improved algorithm for the longest common subsequence problem , 2012, Comput. Oper. Res..

[8]  Todd Easton,et al.  A Specialized Branching and Fathoming Technique for The Longest Common Subsequence Problem , 2007 .

[9]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[10]  Joseph B. Kruskal,et al.  Time Warps, String Edits, and Macromolecules , 1999 .

[11]  Majid Sarrafzadeh,et al.  Area-efficient instruction set synthesis for reconfigurable system-on-chip designs , 2004, Proceedings. 41st Design Automation Conference, 2004..

[12]  田中 俊典 National Center for Biotechnology Information (NCBI) , 2012 .

[13]  Robert W. Irving,et al.  Two Algorithms for the Longest Common Subsequence of Three (or More) Strings , 1992, CPM.

[14]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[15]  Todd Easton,et al.  A large neighborhood search heuristic for the longest common subsequence problem , 2008, J. Heuristics.

[16]  David Eppstein,et al.  Sparse dynamic programming II: convex and concave cost functions , 1992, JACM.

[17]  Graham Kendall,et al.  Hyper-Heuristics: An Emerging Direction in Modern Search Technology , 2003, Handbook of Metaheuristics.

[18]  Chang-Biau Yang,et al.  Fast Algorithms for Finding the Common Subsequence of Multiple Sequences , 2004 .

[19]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1995, SIAM J. Comput..

[20]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[21]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[22]  Cameron Bruce Fraser,et al.  Subsequences and Supersequences of Strings , 1995 .

[23]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[24]  Qingguo Wang,et al.  A Fast Multiple Longest Common Subsequence (MLCS) Algorithm , 2011, IEEE Transactions on Knowledge and Data Engineering.

[25]  Qingguo Wang,et al.  An Efficient Parallel Algorithm for the Multiple Longest Common Subsequence (MLCS) Problem , 2008, 2008 37th International Conference on Parallel Processing.

[26]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[27]  Manuel López-Ibáñez,et al.  Beam search for the longest common subsequence problem , 2009, Comput. Oper. Res..

[28]  Kang Ning,et al.  Deposition and extension approach to find longest common subsequence for thousands of long sequences , 2010, Comput. Biol. Chem..

[29]  Paola Bonizzoni,et al.  Experimenting an approximation algorithm for the LCS , 2001, Discret. Appl. Math..

[30]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[31]  Tao Jiang,et al.  On the Approximation of Shortest Common Supersequences and Longest Common Subsequences , 1994, SIAM J. Comput..

[32]  Bin Ma,et al.  A General Edit Distance between RNA Structures , 2002, J. Comput. Biol..

[33]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[34]  R. Ravi,et al.  Computing Similarity between RNA Strings , 1996, CPM.

[35]  Shyong Jian Shyu,et al.  Finding the longest common subsequence for multiple biological sequences by ant colony optimization , 2009, Comput. Oper. Res..

[36]  Yixin Chen,et al.  A fast parallel algorithm for finding the longest common sequence of multiple biosequences , 2006, BMC Bioinformatics.

[37]  Christian Blum,et al.  Probabilistic Beam Search for the Longest Common Subsequence Problem , 2007, SLS.

[38]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[39]  M. W. Du,et al.  Computing a longest common subsequence for a set of strings , 1984, BIT.