Benchmarking Statistical Multiple Sequence Alignment

The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several potential causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations. multiple sequence alignment, BAli-Phy, protein sequences, structural alignment, homology

[1]  Erik L. L. Sonnhammer,et al.  Automatic assessment of alignment quality , 2005, Nucleic acids research.

[2]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[3]  Serafim Batzoglou,et al.  CONTRAlign: Discriminative Training for Protein Sequence Alignment , 2006, RECOMB.

[4]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[5]  Simon Whelan,et al.  Measuring the distance between multiple sequence alignments , 2012, Bioinform..

[6]  Claus O. Wilke,et al.  Bringing Molecules Back into Molecular Evolution , 2012, PLoS Comput. Biol..

[7]  Kazutaka Katoh,et al.  A simple method to control over-alignment in the MAFFT multiple sequence alignment program , 2016, Bioinform..

[8]  Simon Easteal,et al.  Mind the gaps: evidence of bias in estimates of multiple sequence alignments. , 2007, Molecular biology and evolution.

[9]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[10]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[11]  Vasant G Honavar,et al.  Computational prediction of protein interfaces: A review of data driven methods , 2015, FEBS letters.

[12]  B. Redelings,et al.  Erasing errors due to alignment ambiguity when estimating positive selection. , 2014, Molecular biology and evolution.

[13]  Andrew E. Torda,et al.  Not assessing the efficiency of multiple sequence alignment programs , 2014, Algorithms for Molecular Biology.

[14]  M. Suchard,et al.  Incorporating indel information into phylogeny estimation for rapidly emerging pathogens , 2007, BMC Evolutionary Biology.

[15]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[16]  Guilherme Oliveira,et al.  Assessing the efficiency of multiple sequence alignment programs , 2014, Algorithms for Molecular Biology.

[17]  Fabian Sievers,et al.  Simple chained guide trees give high-quality protein multiple sequence alignments , 2014, Proceedings of the National Academy of Sciences.

[18]  Ari Löytynoja,et al.  An algorithm for progressive multiple alignment of sequences with insertions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Andreas Prlic,et al.  SISYPHUS—structural alignments for proteins with non-trivial relationships , 2006, Nucleic Acids Res..

[20]  Tal Pupko,et al.  Alignment errors strongly impact likelihood-based tests for comparing topologies. , 2014, Molecular biology and evolution.

[21]  Maurits J. J. Dijkstra,et al.  Multiple Sequence Alignment. , 2017, Methods in molecular biology.

[22]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[23]  István Miklós Algorithm for statistical alignment of two sequences derived from a Poisson sequence length distribution , 2003, Discret. Appl. Math..

[24]  Tandy J. Warnow,et al.  FASTSP: linear time calculation of alignment accuracy , 2011, Bioinform..

[25]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[26]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[27]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[28]  Benjamin D. Redelings,et al.  BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny , 2006, Bioinform..

[29]  N. Grishin,et al.  PROMALS3D: a tool for multiple protein sequence and structure alignments , 2008, Nucleic acids research.

[30]  Frédéric Delsuc,et al.  Pitfalls in supermatrix phylogenomics , 2017 .

[31]  Tandy J. Warnow,et al.  The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[33]  Simon Whelan,et al.  Class of multiple sequence alignment algorithm affects genomic analysis. , 2013, Molecular biology and evolution.

[34]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[35]  Dennis R. Livesay,et al.  Probalign: multiple sequence alignment using partition function posterior probabilities , 2006, Bioinform..

[36]  M. Bishop,et al.  Maximum likelihood alignment of DNA sequences. , 1986, Journal of molecular biology.

[37]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[38]  István Miklós,et al.  An improved algorithm for statistical alignment of sequences related by a star tree , 2002, Bulletin of mathematical biology.

[39]  Lenore Cowen,et al.  Touring Protein Space with Matt , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[41]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[42]  Yun S. Song,et al.  An Efficient Algorithm for Statistical Multiple Alignment on Arbitrary Phylogenetic Trees , 2003, J. Comput. Biol..

[43]  D A Morrison,et al.  Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. , 1997, Molecular biology and evolution.

[44]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[45]  Cédric Notredame,et al.  Recent Evolutions of Multiple Sequence Alignment Algorithms , 2007, PLoS Comput. Biol..

[46]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[47]  Yoav Freund,et al.  ResBoost: characterizing and predicting catalytic residues in enzymes , 2009, BMC Bioinformatics.

[48]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[49]  Lucy J. Colwell,et al.  The interface of protein structure, protein biophysics, and molecular evolution , 2012, Protein science : a publication of the Protein Society.

[50]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[51]  Arndt von Haeseler,et al.  Simultaneous statistical multiple alignment and phylogeny reconstruction. , 2005, Systematic biology.

[52]  R. Dean,et al.  Comparative genome analysis and genome evolution of members of the magnaporthaceae family of fungi , 2016, BMC Genomics.

[53]  Quan Le,et al.  Protein multiple sequence alignment benchmarking through secondary structure prediction , 2017, Bioinform..

[54]  M. Gil,et al.  Phylogenetic assessment of alignments reveals neglected tree signal in gaps , 2010, Genome Biology.

[55]  Kimmen Sjölander,et al.  INTREPID—INformation-theoretic TREe traversal for Protein functional site IDentification , 2008, Bioinform..

[56]  Cédric Notredame,et al.  3DCoffee: combining protein sequences and structures within multiple sequence alignments. , 2004, Journal of molecular biology.

[57]  Ziheng Yang,et al.  The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. , 2010, Molecular biology and evolution.

[58]  M. Rosenberg,et al.  Multiple sequence alignment accuracy and phylogenetic inference. , 2006, Systematic biology.

[59]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[60]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[61]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[62]  Jens Ledet Jensen,et al.  Recursions for statistical multiple alignment , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[63]  Fabian Sievers,et al.  Reply to Tan et al.: Differences between real and simulated proteins in multiple sequence alignments , 2015, Proceedings of the National Academy of Sciences.

[64]  Ian Holmes,et al.  Historian: accurate reconstruction of ancestral sequences and evolutionary rates , 2016, bioRxiv.

[65]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[66]  István Miklós,et al.  StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees , 2008, Bioinform..

[67]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[68]  J A Lake,et al.  The order of sequence alignment can bias the selection of tree topology. , 1991, Molecular biology and evolution.

[69]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[70]  Michael G. Nute,et al.  Scaling statistical multiple sequence alignment to large datasets , 2016, BMC Genomics.

[71]  M. Simmons,et al.  Alignment of, and phylogenetic inference from, random sequences: the susceptibility of alternative alignment methods to creating artifactual resolution and support. , 2010, Molecular phylogenetics and evolution.

[72]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[73]  Hayato Yamana,et al.  Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost , 2006, BMC Bioinformatics.

[74]  Manuel Gil,et al.  Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. , 2012, Methods in molecular biology.

[75]  N. Mulder,et al.  Tools and resources for identifying protein families, domains and motifs , 2001, Genome Biology.

[76]  Russell F. Doolittle,et al.  “Homology” in proteins and nucleic acids: A terminology muddle and a way out of it , 1987, Cell.

[77]  Alessandra Carbone,et al.  Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence , 2016, PLoS Comput. Biol..

[78]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[79]  R. A. George,et al.  Protein domain identification and improved sequence similarity searching using PSI‐BLAST , 2002, Proteins.