DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment

BackgroundDIALIGN-T is a reimplementation of the multiple-alignment program DIALIGN. Due to several algorithmic improvements, it produces significantly better alignments on locally and globally related sequence sets than previous versions of DIALIGN. However, like the original implementation of the program, DIALIGN-T uses a a straight-forward greedy approach to assemble multiple alignments from local pairwise sequence similarities. Such greedy approaches may be vulnerable to spurious random similarities and can therefore lead to suboptimal results. In this paper, we present DIALIGN-TX, a substantial improvement of DIALIGN-T that combines our previous greedy algorithm with a progressive alignment approach.ResultsOur new heuristic produces significantly better alignments, especially on globally related sequences, without increasing the CPU time and memory consumption exceedingly. The new method is based on a guide tree; to detect possible spurious sequence similarities, it employs a vertex-cover approximation on a conflict graph. We performed benchmarking tests on a large set of nucleic acid and protein sequences For protein benchmarks we used the benchmark database BALIBASE 3 and an updated release of the database IRMBASE 2 for assessing the quality on globally and locally related sequences, respectively. For alignment of nucleic acid sequences, we used BRAliBase II for global alignment and a newly developed database of locally related sequences called DIRM-BASE 1. IRMBASE 2 and DIRMBASE 1 are constructed by implanting highly conserved motives at random positions in long unalignable sequences.ConclusionOn BALIBASE3, our new program performs significantly better than the previous program DIALIGN-T and outperforms the popular global aligner CLUSTAL W, though it is still outperformed by programs that focus on global alignment like MAFFT, MUSCLE and T-COFFEE. On the locally related test sets in IRMBASE 2 and DIRM-BASE 1, our method outperforms all other programs while MAFFT E-INSi is the only method that comes close to the performance of DIALIGN-TX.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  Burkhard Morgenstern,et al.  Stability of multiple alignments and phylogenetic trees: an analysis of ABC-transporter proteins family , 2008, Algorithms for Molecular Biology.

[3]  A. Wilm,et al.  A benchmark of multiple sequence alignment programs upon structural RNAs , 2005, Nucleic acids research.

[4]  Burkhard Morgenstern,et al.  A space-efficient algorithm for aligning large genomic sequences , 2000, Bioinform..

[5]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[6]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[7]  Gilles Didier,et al.  Local Decoding of Sequences and Alignment-Free Comparison , 2006, J. Comput. Biol..

[8]  Michael Kaufmann,et al.  BMC Bioinformatics BioMed Central , 2005 .

[9]  Michael Kaufmann,et al.  DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors , 2004, BMC Bioinformatics.

[10]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[11]  Alexander Sczyrba,et al.  AGenDA: homology-based gene prediction , 2003, Bioinform..

[12]  B. Morgenstern,et al.  AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome , 2006, Genome Biology.

[13]  Sonja J. Prohaska,et al.  Multiple sequence alignment with user-defined constraints at GOBICS , 2005, Bioinform..

[14]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[15]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Burkhard Morgenstern,et al.  A simple and space-efficient fragment-chaining algorithm for alignment of DNA and protein sequences , 2002, Appl. Math. Lett..

[17]  Hans-Peter Lenhof,et al.  An exact solution for the Segment-to-Segment multiple sequence alignment problem , 1998, German Conference on Bioinformatics.

[18]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[19]  Sonja J. Prohaska,et al.  Multiple sequence alignment with user-defined anchor points , 2006, Algorithms for Molecular Biology.

[20]  Burkhard Morgenstern,et al.  Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources , 2006, BMC Bioinformatics.

[21]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[22]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[23]  Burkhard Morgenstern,et al.  Speeding Up the DIALIGN Multiple Alignment Program by Using the 'Greedy Alignment of BIOlogical Sequences LIBrary' (GABIOS-LIB) , 2000, JOBIM.

[24]  Michael Brudno,et al.  The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences , 2004, Nucleic Acids Res..

[25]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[26]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[27]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[28]  Christopher J. Lee,et al.  Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems , 2004, Bioinform..

[29]  Mark Hoebeke,et al.  Local Similarities and Clustering of Biological Sequences: New Insights from N-local Decoding , 2007 .

[30]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[31]  Andreas Wilm,et al.  An enhanced RNA alignment benchmark for sequence alignment programs , 2006, Algorithms for Molecular Biology.

[32]  Kenneth L. Clarkson,et al.  A Modification of the Greedy Algorithm for Vertex Cover , 1983, Inf. Process. Lett..

[33]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[34]  Burkhard Morgenstern,et al.  Exon discovery by genomic sequence alignment , 2002, Bioinform..

[35]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[36]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[37]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[38]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[39]  W. Taylor A flexible method to align large numbers of biological sequences , 2005, Journal of Molecular Evolution.

[40]  Burkhard Morgenstern,et al.  DIALIGN: multiple DNA and protein sequence alignment at BiBiServ , 2004, Nucleic Acids Res..