Hardness Results on Local Multiple Alignment of Biological Sequences

This paper studies the local multiple alignment problem, which is, given protein or DNA sequences, to locate a region (i.e., a substring) of fixed length from each sequence so that the score determined from the set of regions is optimized. We consider the following scoring schemes: the relative entropy score (i.e., average information content), the sum-of-pairs score and a relative entropy-like score introduced by Li, et al. We prove that multiple local alignment is NP-hard under each of these scoring schemes. In particular, we prove that multiple local alignment is APX-hard under relative entropy scoring. It implies that unless P =NP there is no polynomial time algorithm whose worst case approximation error can be arbitrarily specified(precisely, a polynomial time approximation scheme). Several related theoretical results are also provided.

[1]  P Horton A branch and bound algorithm for local multiple alignment. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[2]  Bin Ma,et al.  Finding Similar Regions in Many Sequences , 2002, J. Comput. Syst. Sci..

[3]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[5]  Paul Horton,et al.  An Upper Bound on the Hardness of Exact Matrix Based Motif Discovery , 2005, CPM.

[6]  Paul Horton Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Alignment of DNA and Protein Sequences , 2001, J. Comput. Biol..

[7]  Giorgio Ausiello,et al.  Theoretical Computer Science Approximate Solution of Np Optimization Problems * , 2022 .

[8]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[9]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[10]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[11]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[12]  Giorgio Gambosi,et al.  Complexity and approximation: combinatorial optimization problems and their approximability properties , 1999 .

[13]  M. A. McClure,et al.  A Comparative Analysis of Computational Motif-Detection Methods , 1998, Pacific Symposium on Biocomputing.

[14]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[15]  D. Gusfield Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993 .

[16]  Eugene L. Lawler,et al.  Approximation Algorithms for Multiple Sequence Alignment , 1994, Theor. Comput. Sci..

[17]  Carsten Lund,et al.  Proof verification and the hardness of approximation problems , 1998, JACM.

[18]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[19]  Hiroki Arimura,et al.  On approximation algorithms for local multiple alignment , 2000, RECOMB '00.

[20]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.