Aligning alignments exactly

A basic computational problem that arises in both the construction and local-search phases of the best heuristics for multiple sequence alignment is that of aligning the columns of two multiple alignments. When the scoring function is the sum-of-pairs objective and induced pairwise alignments are evaluated using linear gap-costs, we call this problem Aligning Alignments. While seemingly a straightforward extension of two-sequence alignment, we prove it is actually NP-complete. As explained in the paper, this provides the first demonstration that minimizing linear gap-costs, in the context of multiple sequence alignment, is inherently hard.We also develop an exact algorithm for Aligning Alignments that is remarkably efficient in practice, both in time and space. Even though the problem is NP-complete, computational experiments on both biological and simulated data show we can compute optimal alignments for all benchmark instances in two standard datasets, and solve very-large random instances with highly-gapped sequences.

[1]  O. Gotoh Alignment of three biological sequences with an efficient traceback procedure. , 1986, Journal of theoretical biology.

[2]  Eugene W. Myers,et al.  ReAligner: a program for refining DNA sequence multi-alignments , 1997, RECOMB '97.

[3]  Lusheng Wang,et al.  Improved Approximation Algorithms for Tree Alignment , 1996, J. Algorithms.

[4]  M. Waterman,et al.  Line geometries for sequence comparisons , 1984 .

[5]  D. Gusfield Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993 .

[6]  David S. Johnson,et al.  Some Simplified NP-Complete Graph Problems , 1976, Theor. Comput. Sci..

[7]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[8]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[9]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[10]  Osamu Gotoh,et al.  Further improvement in methods of group-to-group sequence alignment with generalized profile operations , 1994, Comput. Appl. Biosci..

[11]  M. Fredman,et al.  Algorithms for computing evolutionary similarity measures with length independent gap penalties , 1984 .

[12]  Jens Stoye,et al.  An iterative method for faster sum-of-pairs multiple sequence alignment , 2000, Bioinform..

[13]  Knut Reinert,et al.  A polyhedral approach to sequence alignment problems , 2000, Discret. Appl. Math..

[14]  Eugene L. Lawler,et al.  Approximation Algorithms for Multiple Sequence Alignment , 1994, Theor. Comput. Sci..

[15]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[16]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[17]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[18]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[19]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[20]  M. Jain Going against the Grain , 1996 .

[21]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[22]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[24]  T. Smith,et al.  Optimal sequence alignments. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Sandeep K. Gupta,et al.  Improving the Practical Space and Time Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment , 1995, J. Comput. Biol..

[26]  Osamu Gotoh,et al.  Optimal alignment between groups of sequences and its application to multiple sequence alignment , 1993, Comput. Appl. Biosci..

[27]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[28]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[29]  Eugene W. Myers,et al.  Progressive multiple alignment with constraints , 1997, RECOMB '97.

[30]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[31]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[32]  David J. Lipman,et al.  MULTIPLE ALIGNMENT , COMMUNICATION COST , AND GRAPH MATCHING * , 1992 .

[33]  Tao Jiang,et al.  Aligning sequences via an evolutionary tree: complexity and approximation , 1994, STOC '94.

[34]  Peter J. Munson,et al.  A novel randomized iterative strategy for aligning multiple protein sequences , 1991, Comput. Appl. Biosci..

[35]  Bin Ma,et al.  Alignment between Two Multiple Alignments , 2003, CPM.

[36]  Todd Wareham,et al.  A Simplified Proof of the NP- and MAX SNP-Hardness of Multiple Sequence Tree Alignment , 1995, J. Comput. Biol..

[37]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[38]  L. Allison,et al.  Fast, optimal alignment of three sequences using linear gap costs. , 2000, Journal of theoretical biology.

[39]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[40]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[41]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[42]  John D. Kececioglu,et al.  Aligning Alignments , 1998, CPM.

[43]  R. Ravi,et al.  Approximation Algorithms for Multiple Sequence Alignment Under a Fixed Evolutionary Tree , 1995, CPM.