Assessing the efficiency of multiple sequence alignment programs

BackgroundMultiple sequence alignment (MSA) is an extremely useful tool for molecular and evolutionary biology and there are several programs and algorithms available for this purpose. Although previous studies have compared the alignment accuracy of different MSA programs, their computational time and memory usage have not been systematically evaluated. Given the unprecedented amount of data produced by next generation deep sequencing platforms, and increasing demand for large-scale data analysis, it is imperative to optimize the application of software. Therefore, a balance between alignment accuracy and computational cost has become a critical indicator of the most suitable MSA program. We compared both accuracy and cost of nine popular MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the benchmark alignment dataset BAliBASE and discuss the relevance of some implementations embedded in each program’s algorithm. Accuracy of alignment was calculated with the two standard scoring functions provided by BAliBASE, the sum-of-pairs and total-column scores, and computational costs were determined by collecting peak memory usage and time of execution.ResultsOur results indicate that mostly the consistency-based programs Probcons, T-Coffee, Probalign and MAFFT outperformed the other programs in accuracy. Whenever sequences with large N/C terminal extensions were present in the BAliBASE suite, Probalign, MAFFT and also CLUSTAL OMEGA outperformed Probcons and T-Coffee. The drawback of these programs is that they are more memory-greedy and slower than POA, CLUSTALW, DIALIGN-TX, and MUSCLE. CLUSTALW and MUSCLE were the fastest programs, being CLUSTALW the least RAM memory demanding program.ConclusionsBased on the results presented herein, all four programs Probcons, T-Coffee, Probalign and MAFFT are well recommended for better accuracy of multiple sequence alignments. T-Coffee and recent versions of MAFFT can deliver faster and reliable alignments, which are specially suited for larger datasets than those encountered in the BAliBASE suite, if multi-core computers are available. In fact, parallelization of alignments for multi-core computers should probably be addressed by more programs in a near future, which will certainly improve performance significantly.

[1]  Kazutaka Katoh,et al.  Parallelization of the MAFFT multiple sequence alignment program , 2010, Bioinform..

[2]  Desmond G. Higgins,et al.  Analysis and Comparison of Benchmarks for Multiple Sequence Alignment , 2006, Silico Biol..

[3]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[4]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[5]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[6]  Carito Guziolowski,et al.  Algorithms for Molecular Biology , 2007 .

[7]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[8]  M. Waterman,et al.  Comparative biosequence metrics , 2005, Journal of Molecular Evolution.

[9]  Dennis R. Livesay,et al.  Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[10]  Olivier Poch,et al.  A new protein linear motif benchmark for multiple sequence alignment software , 2008, BMC Bioinformatics.

[11]  Michael Kaufmann,et al.  DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment , 2008, Algorithms for Molecular Biology.

[12]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[13]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[14]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[15]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[16]  Elisabeth R. M. Tillier,et al.  The accuracy of several multiple sequence alignment programs for proteins , 2006, BMC Bioinformatics.

[17]  Robert C. Edgar,et al.  Optimizing substitution matrix choice and gap parameters for sequence alignment , 2009, BMC Bioinformatics.

[18]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[19]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[20]  Erik L L Sonnhammer,et al.  Quality assessment of multiple alignment programs , 2002, FEBS letters.

[21]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[22]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[23]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[24]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[25]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[26]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[27]  Desmond G. Higgins,et al.  Sequence embedding for fast construction of guide trees for multiple sequence alignment , 2010, Algorithms for Molecular Biology.

[28]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..