An Eulerian Path Approach to Global Multiple Alignment for DNA Sequences

With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.

[1]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[2]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[3]  M S Waterman,et al.  DNA sequence assembly and multiple sequence alignment by an Eulerian path approach. , 2003, Cold Spring Harbor symposia on quantitative biology.

[4]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[7]  W. Taylor A flexible method to align large numbers of biological sequences , 2005, Journal of Molecular Evolution.

[8]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[9]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[10]  P. Argos,et al.  Motif recognition and alignment for many sequences by comparison of dot-matrices. , 1991, Journal of molecular biology.

[11]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[12]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[13]  M S Waterman,et al.  Consensus methods for DNA and protein sequence alignment. , 1990, Methods in enzymology.

[14]  H. M. Martinez A flexible multiple sequence alignment program. , 1988, Nucleic acids research.

[15]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[16]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[17]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[18]  M. Waterman,et al.  Line geometries for sequence comparisons , 1984 .

[19]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[20]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[21]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[22]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[23]  Smith Rf,et al.  Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. , 1992 .

[24]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[25]  Desmond G. Higgins,et al.  Fast and sensitive multiple sequence alignments on a microcomputer , 1989, Comput. Appl. Biosci..

[26]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[27]  R. F. Smith,et al.  Automatic generation of primary sequence patterns from sets of related protein sequences. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[28]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[29]  David Sankoff,et al.  Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA , 1976, Journal of Molecular Evolution.

[30]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[31]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[32]  O. Gotoh,et al.  Multiple sequence alignment: algorithms and applications. , 1999, Advances in biophysics.

[33]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[34]  Martin Vingron,et al.  Alignment Networks and Electrical Networks , 1996, Discret. Appl. Math..

[35]  L. Gordon,et al.  Tutorial on large deviations for the binomial distribution. , 1989, Bulletin of mathematical biology.

[36]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[37]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[38]  G J Barton,et al.  Evaluation and improvements in the automatic alignment of protein sequences. , 1987, Protein engineering.

[39]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..