New Algorithms for Multiple DNA Sequence Alignment

We present a mathematical framework for anchoring inglobal multiple alignment. Our framework uses anchors that are hits to spaced seeds and identifies anchors progressively, using a phylogenetic tree. We compute anchors in the tree starting at the root and going to the leaves, and from the leaves going up. In both cases, we compute thresholds for anchors to minimize errors. One innovative aspect of our approach is the approximate inference of ancestral sequences with accomodation for ambiguity. This, combined with proper scoring techniques and seeding, lets us pick many anchors in homologous positions as we align up a phylogenetic tree, minimizing total work. Our algorithm is reasonably successful in simulations, is comparable to existing software in terms of accuracy and substantially more efficient.

[1]  Daniel G. Brown,et al.  Vector Seeds: An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specifity , 2003, WABI.

[2]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[3]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[4]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[5]  David Eppstein,et al.  Sparse dynamic programming II: convex and concave cost functions , 1992, JACM.

[6]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[7]  Daniel G. Brown,et al.  Multiple Vector Seeds for Protein Alignment , 2004, WABI.

[8]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[9]  Yu Zhang,et al.  An Eulerian Path Approach to Global Multiple Alignment for DNA Sequences , 2003, J. Comput. Biol..

[10]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[11]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[12]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[13]  John D. Kececioglu,et al.  Aligning Alignments , 1998, CPM.

[14]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[15]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[16]  Daniel G. Brown,et al.  Optimal Spaced Seeds for Homologous Coding Regions , 2004, J. Bioinform. Comput. Biol..

[17]  Michael Brudno,et al.  Fast and sensitive alignment of large genomic sequences , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[18]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[19]  David Eppstein,et al.  Sparse dynamic programming , 1990, SODA '90.

[20]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[22]  Bin Ma,et al.  On spaced seeds for similarity search , 2004, Discret. Appl. Math..

[23]  Bin Ma,et al.  Alignment between Two Multiple Alignments , 2003, CPM.