A Large Version of the Small Parsimony Problem

Given a multiple alignment over k sequences, an evolutionary tree relating the sequences, and a subadditive gap penalty function (e.g. an affine function), we reconstruct the internal nodes of the tree optimally: we find the optimal explanation in terms of indels of the observed gaps and find the most parsimonious assignment of nucleotides. The gaps of the alignment are represented in a so-called gap graph, and through theoretically sound preprocessing the graph is reduced to pave the way for a running time which in all but the most pathological examples is far better than the exponential worst case time. E.g. for a tree with nine leaves and a random alignment of length 10.000 with 60% gaps, the running time is on average around 45 seconds. For a real alignment of length 9868 of nine HIV-1 sequences, the running time is less than one second.

[1]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[2]  Ernst Althaus,et al.  Multiple sequence alignment with arbitrary gap costs: Computing an optimal solution using polyhedral combinatorics , 2002, ECCB.

[3]  M. Fredman,et al.  Algorithms for computing evolutionary similarity measures with length independent gap penalties , 1984 .

[4]  D Sankoff,et al.  Matching sequences under deletion-insertion constraints. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[5]  J Hein,et al.  A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. , 1989, Molecular biology and evolution.

[6]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[7]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[8]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[9]  W C Barker,et al.  Searching the protein sequence database. , 1984, Bulletin of mathematical biology.

[10]  Lusheng Wang,et al.  Improved Approximation Algorithms for Tree Alignment , 1996, CPM.

[11]  J. Stoye Multiple sequence alignment with the Divide-and-Conquer method. , 1998, Gene.

[12]  Tao Jiang,et al.  Approximation algorithms for tree alignment with a given phylogeny , 1996, Algorithmica.

[13]  J. Hartigan MINIMUM MUTATION FITS TO A GIVEN TREE , 1973 .

[14]  Peter H. Sellers,et al.  An Algorithm for the Distance Between Two Finite Sequences , 1974, J. Comb. Theory, Ser. A.