SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.

Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.

[1]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[2]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[3]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[4]  S. Jeffery Evolution of Protein Molecules , 1979 .

[5]  J. Oliver,et al.  The general stochastic model of nucleotide substitution. , 1990, Journal of theoretical biology.

[6]  J. Hein Unified approach to alignment and phylogenies. , 1990, Methods in enzymology.

[7]  M. Litzkow REMOTE UNIX TURNING IDLE WORKSTATIONS INTO CYCLE SERVERS , 1992 .

[8]  Maryse Condé Tree of Life , 1992 .

[9]  J. Bull,et al.  An Empirical Test of Bootstrapping as a Method for Assessing Confidence in Phylogenetic Analysis , 1993 .

[10]  R. Gutell,et al.  Lessons from an evolving rRNA: 16S and 23S rRNA structures from a comparative perspective. , 1994, Microbiological reviews.

[11]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[12]  W. Wheeler,et al.  MALIGN: A Multiple Sequence Alignment Program , 1994 .

[13]  Ward C. Wheeler,et al.  SEQUENCE ALIGNMENT, PARAMETER SENSITIVITY, AND THE PHYLOGENETIC ANALYSIS OF MOLECULAR DATA , 1995 .

[14]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[15]  W. Wheeler OPTIMIZATION ALIGNMENT: THE END OF MULTIPLE SEQUENCE ALIGNMENT IN PHYLOGENETICS? , 1996 .

[16]  M Steel,et al.  Links between maximum likelihood and maximum parsimony under a simple model of site substitution. , 1997, Bulletin of mathematical biology.

[17]  John D. Kececioglu,et al.  Aligning Alignments , 1998, CPM.

[18]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[19]  R. Ravi,et al.  GESTALT: Genomic Steiner Alignments , 1999, CPM.

[20]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[21]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[22]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[23]  Nan Yu,et al.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs , 2002, BMC Bioinformatics.

[24]  Tandy J. Warnow,et al.  The Accuracy of Fast Phylogenetic Methods for Large Datasets , 2001, Pacific Symposium on Biocomputing.

[25]  R. Gutell,et al.  The accuracy of ribosomal RNA comparative structure models. , 2002, Current opinion in structural biology.

[26]  Nan Yu,et al.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs: Correction , 2002, BMC Bioinformatics.

[27]  Michael J. Sanderson,et al.  R8s: Inferring Absolute Rates of Molecular Evolution, Divergence times in the Absence of a Molecular Clock , 2003, Bioinform..

[28]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[29]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[30]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[31]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[32]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[33]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[34]  Arndt von Haeseler,et al.  Simultaneous statistical multiple alignment and phylogeny reconstruction. , 2005, Systematic biology.

[35]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[36]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[37]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[38]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[39]  Benjamin D. Redelings,et al.  BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny , 2006, Bioinform..

[40]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[41]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..

[42]  John D. Kececioglu,et al.  Multiple alignment by aligning alignments , 2007, ISMB/ECCB.

[43]  M. Donoghue,et al.  Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches , 2009, BMC Evolutionary Biology.

[44]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[45]  István Miklós,et al.  StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees , 2008, Bioinform..

[46]  Tandy J. Warnow,et al.  Barking Up The Wrong Treelength: The Impact of Gap Penalty on Alignment and Tree Accuracy , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[48]  W. Wheeler,et al.  POY version 4: phylogenetic analysis using dynamic homologies , 2010, Cladistics : the international journal of the Willi Hennig Society.

[49]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[50]  C. Randal Linder,et al.  Multiple sequence alignment: a major challenge to large-scale phylogenetics , 2011, PLoS currents.