Polynomial-Time Approximation Algorithms for Weighted LCS Problem

We deal with a variant of the well-known Longest Common Subsequence (LCS) problem for weighted sequences. A (biological) weighted sequence determines the probability for each symbol to occur at a given position of the sequence (such sequences are also called Position Weighted Matrices, PWM). Two possible such versions of the problem were proposed by (Amir et al., 2009 and 2010), they are called LCWS and LCWS2 (Longest Common Weighted Subsequence 1 and 2 Problem). We solve an open problem, stated in conclusions of the paper by Amir et al., of the tractability of a log-probability version of LCWS2 problem for bounded alphabets, showing that it is NP-hard already for an alphabet of size 2. We also improve the (1/|Σ|)-approximation algorithm given by Amir et al. (where Σ is the alphabet): we show a polynomial-time approximation scheme (PTAS) for the LCWS2 problem using O(n5) space. We also give a simpler (1/2)-approximation algorithm for the same problem using only O(n2) space.

[1]  Maxime Crochemore,et al.  An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[2]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[3]  Tsvi Kopelowitz,et al.  Property matching and weighted matching , 2006, Theor. Comput. Sci..

[4]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[5]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[6]  Costas S. Iliopoulos,et al.  The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.

[7]  Costas S. Iliopoulos,et al.  Parallel Algorithms for Degenerate and Weighted Sequences Derived from High Throughput Sequencing Technologies , 2009, Stringology.

[8]  Costas S. Iliopoulos,et al.  An Algorithmic Framework for Motif Discovery Problems in Weighted Sequences , 2010, CIAC.

[9]  Costas S. Iliopoulos,et al.  Varieties of Regularities in Weighted Sequences , 2010, AAIM.

[10]  Costas S. Iliopoulos,et al.  Approximate Matching in Weighted Sequences , 2006, CPM.

[11]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[12]  Costas S. Iliopoulos,et al.  Motif Extraction from Weighted Sequences , 2004, SPIRE.

[13]  Costas S. Iliopoulos,et al.  Algorithms for mapping short degenerate and weighted sequences to a reference genome , 2009, Int. J. Comput. Biol. Drug Des..

[14]  Costas S. Iliopoulos,et al.  Computing the Repetitions in a Biological Weighted Sequence , 2005, J. Autom. Lang. Comb..

[15]  Costas S. Iliopoulos,et al.  Computation of Repetitions and Regularities of Biologically Weighted Sequences , 2006, J. Comput. Biol..

[16]  Costas S. Iliopoulos,et al.  String Matching with Swaps in a Weighted Sequence , 2004, CIS.

[17]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[18]  Jing Fan,et al.  Loose and strict repeats in weighted sequences of proteins. , 2010, Protein and peptide letters.

[19]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[20]  Amihood Amir,et al.  Weighted LCS , 2009, J. Discrete Algorithms.