Polynomial-time approximation algorithms for weighted LCS problem

We consider a variant of the well-known Longest Common Subsequence (LCS) problem for weighted sequences. A weighted sequence determines the probability for each symbol to occur at a given position of the sequence (such sequences are also called Position Weighted Matrices, PWM). Two possible such versions of the problem were proposed by (Amir et?al., 2009 and 2010), they are called LCWS and LCWS2 (Longest Common Weighted Subsequence 1 and 2). We solve an open problem, stated in the paper by Amir et?al., of the tractability of a log-probability version of LCWS2 problem for bounded alphabets, showing that it is NP-hard already for an alphabet of size 2. We also improve the ( 1 / | Σ | ) -approximation algorithm given by Amir et?al. (where Σ is the alphabet): we show a polynomial-time approximation scheme (PTAS) for the LCWS2 problem using O ( n 5 ) space. We also give a simpler (1/2)-approximation algorithm for the same problem using only O ( n 2 ) space.

[1]  Solon P. Pissis,et al.  Optimal Computation of all Repetitions in a Weighted String , 2014, ICABD.

[2]  J. Gern The Sequence of the Human Genome , 2001, Science.

[3]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[4]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[5]  Costas S. Iliopoulos,et al.  Computing the Repetitions in a Biological Weighted Sequence , 2005, J. Autom. Lang. Comb..

[6]  Maxime Crochemore,et al.  An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[7]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[8]  Jing Fan,et al.  Loose and strict repeats in weighted sequences of proteins. , 2010, Protein and peptide letters.

[9]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[10]  Costas S. Iliopoulos,et al.  Approximate Matching in Weighted Sequences , 2006, CPM.

[11]  Costas S. Iliopoulos,et al.  Varieties of Regularities in Weighted Sequences , 2010, AAIM.

[12]  Costas S. Iliopoulos,et al.  String Matching with Swaps in a Weighted Sequence , 2004, CIS.

[13]  Costas S. Iliopoulos,et al.  Parallel Algorithms for Degenerate and Weighted Sequences Derived from High Throughput Sequencing Technologies , 2009, Stringology.

[14]  Amihood Amir,et al.  Weighted LCS , 2009, IWOCA.

[15]  Costas S. Iliopoulos,et al.  An Algorithmic Framework for Motif Discovery Problems in Weighted Sequences , 2010, CIAC.

[16]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[17]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[18]  Tsvi Kopelowitz,et al.  Property matching and weighted matching , 2006, Theor. Comput. Sci..

[19]  Costas S. Iliopoulos,et al.  Locating tandem repeats in weighted sequences in proteins , 2013, BMC Bioinformatics.

[20]  Costas S. Iliopoulos,et al.  Computation of Repetitions and Regularities of Biologically Weighted Sequences , 2006, J. Comput. Biol..

[21]  Costas S. Iliopoulos,et al.  The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.

[22]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[23]  Costas S. Iliopoulos,et al.  Motif Extraction from Weighted Sequences , 2004, SPIRE.

[24]  Costas S. Iliopoulos,et al.  Algorithms for mapping short degenerate and weighted sequences to a reference genome , 2009, Int. J. Comput. Biol. Drug Des..

[25]  Xiangqun H. Zheng,et al.  A Whole-Genome Assembly of Drosophila , 2000 .

[26]  L. Bergroth,et al.  A survey of longest common subsequence algorithms , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.