Near optimal multiple sequence alignments using a traveling salesman problem approach

We present a new method for the calculation of multiple sequence alignments (MSAs). The input to our problem are n protein sequences. We assume that the sequences are related with each other and that there exists some unknown evolutionary tree that corresponds to the MSA. One advantage of our method is that the scoring can be done with reference to this phylogenetic tree, even though the tree structure itself may remain unknown. Instead of computing an evolutionary tree, we only need to compute a circular tour of the tree which is determined via a traveling salesman problem (TSP) algorithm. Our algorithm can calculate a near optimal MSA and has a performance guarantee of n-1/n.opt (where opt is the optimal score of the MSA). The algorithm runs in O(k/sup 2/n/sup 2/) time, where k is the length of the longest input sequence. From there, we improve the alignment further. Experimental results are shown at the end.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Giovanni Rinaldi,et al.  A Branch-and-Cut Algorithm for the Resolution of Large-Scale Symmetric Traveling Salesman Problems , 1991, SIAM Rev..

[3]  Lusheng Wang,et al.  Improved Approximation Algorithms for Tree Alignment , 1996, J. Algorithms.

[4]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[5]  D. K. Y. Chiu,et al.  A multiple sequence comparison method , 1993 .

[6]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[7]  Sandeep K. Gupta,et al.  Improving the Practical Space and Time Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment , 1995, J. Comput. Biol..

[8]  Gaston H. Gonnet,et al.  Evaluation Measures of Multiple Sequence Alignments , 2000, J. Comput. Biol..

[9]  H. M. Martinez A flexible multiple sequence alignment program. , 1988, Nucleic acids research.

[10]  R. Ravi,et al.  Approximation Algorithms for Multiple Sequence Alignment Under a Fixed Evolutionary Tree , 1995, CPM.

[11]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[12]  M. Waterman,et al.  Line geometries for sequence comparisons , 1984 .

[13]  S. Henikoff,et al.  Blocks database and its applications. , 1996, Methods in enzymology.

[14]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[15]  John Kececioglu,et al.  Making the Shortest-paths Approach to Sum-of-pairs Multiple Sequence Alignment More Space Eecient in Practice , 1995 .

[16]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Sandeep K. Gupta,et al.  Making the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment More Space Efficient in Practice (Extended Abstract) , 1995, CPM.

[18]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[19]  R. Graham,et al.  The steiner problem in phylogeny is NP-complete , 1982 .

[20]  Martin Grötschel,et al.  Solution of large-scale symmetric travelling salesman problems , 1991, Math. Program..

[21]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[22]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[23]  David S. Johnson,et al.  Local Optimization and the Traveling Salesman Problem , 1990, ICALP.

[24]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[25]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[26]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[27]  Eugene L. Lawler,et al.  Approximation Algorithms for Multiple Sequence Alignment , 1994, CPM.

[28]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[29]  DAVID JOHNSON,et al.  More approaches to the travelling salesman guide , 1987, Nature.

[30]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[31]  William R. Taylor,et al.  Multiple sequence alignment by a pairwise algorithm , 1987, Comput. Appl. Biosci..

[32]  D Gusfield,et al.  Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993, Bulletin of mathematical biology.

[33]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[34]  S. Altschul,et al.  Optimal sequence alignment using affine gap costs. , 1986, Bulletin of mathematical biology.