Title Fast , scalable generation of high-quality protein multiplesequence alignments using Clustal Omega

Multiple Sequence Alignments are fundamental to many sequence analysis methods. Most alignments are computed using the Progressive Alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data-sets of the size of many thousands of sequences. Some methods allow computation of larger datasets while sacrificing quality, and others produce high quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test-cases is similar to that of the high-quality aligners. On larger data-sets Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.

[1]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[2]  J. Thompson,et al.  Issues in bioinformatics benchmarking: the case study of multiple sequence alignment , 2010, Nucleic acids research.

[3]  Yongchao Liu,et al.  MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities , 2010, Bioinform..

[4]  Desmond G. Higgins,et al.  Sequence embedding for fast construction of guide trees for multiple sequence alignment , 2010, Algorithms for Molecular Biology.

[5]  O. Gascuel,et al.  SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. , 2010, Molecular biology and evolution.

[6]  Robert C. Edgar,et al.  Quality measures for protein alignment benchmarks , 2010, Nucleic acids research.

[7]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[8]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[9]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[10]  D. Higgins,et al.  R-Coffee: a method for multiple alignment of non-coding RNA , 2008, Nucleic acids research.

[11]  Jaap Heringa,et al.  PRALINETM: a strategy for improved multiple alignment of transmembrane proteins , 2008, Bioinform..

[12]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[13]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..

[14]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[15]  Erik L. L. Sonnhammer,et al.  Kalign – an accurate and fast multiple sequence alignment algorithm , 2005, BMC Bioinformatics.

[16]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[17]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[18]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[19]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[20]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[21]  Geoffrey J. Barton,et al.  The Jalview Java alignment editor , 2004, Bioinform..

[22]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[23]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[24]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[25]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[26]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..