Large-scale multiple sequence alignment and tree estimation using SATé.

SATé is a method for estimating multiple sequence alignments and trees that has been shown to produce highly accurate results for datasets with large numbers of sequences. Running SATé using its default settings is very simple, but improved accuracy can be obtained by modifying its algorithmic parameters. We provide a detailed introduction to the algorithmic approach used by SATé, and instructions for running a SATé analysis using the GUI under default settings. We also provide a discussion of how to modify these settings to obtain improved results, and how to use SATé in a phylogenetic analysis pipeline.

[1]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[2]  Sampath Kannan,et al.  Computing the local consensus of trees , 1995, SODA '95.

[3]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[4]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Richard A. Goldstein,et al.  rtREV: An Amino Acid Substitution Matrix for Inference of Retrovirus and Reverse Transcriptase Phylogeny , 2002, Journal of Molecular Evolution.

[7]  David Posada,et al.  ProtTest: selection of best-fit models of protein evolution , 2005, Bioinform..

[8]  István Miklós,et al.  StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees , 2008, Bioinform..

[9]  Tandy J. Warnow,et al.  Fast and accurate methods for phylogenomic analyses , 2011, BMC Bioinformatics.

[10]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[11]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[12]  Timothy J. Harlow,et al.  Ancient origin of the divergent forms of leucyl-tRNA synthetases in the Halobacteriales , 2012, BMC Evolutionary Biology.

[13]  Nina Amenta,et al.  Case study: visualizing sets of evolutionary trees , 2002, IEEE Symposium on Information Visualization, 2002. INFOVIS 2002..

[14]  David Fernández-Baca,et al.  iGTP: A software package for large-scale gene tree parsimony analysis , 2010, BMC Bioinformatics.

[15]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[16]  Colin N. Dewey,et al.  Whole-genome alignment. , 2012, Methods in molecular biology.

[17]  N. Goldman,et al.  Different versions of the Dayhoff rate matrix. , 2005, Molecular biology and evolution.

[18]  Tandy J. Warnow,et al.  SEPP: SATe -Enabled Phylogenetic Placement , 2011, Pacific Symposium on Biocomputing.

[19]  Tandy J. Warnow,et al.  The Effect of the Guide Tree on Multiple Sequence Alignments and Subsequent Phylogenetic Analysis , 2007, Pacific Symposium on Biocomputing.

[20]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[21]  John D. Kececioglu,et al.  Multiple alignment by aligning alignments , 2007, ISMB/ECCB.

[22]  Luay Nakhleh,et al.  The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection , 2012, PLoS genetics.

[23]  B. Hall Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. , 2005, Molecular biology and evolution.

[24]  M. Ragan,et al.  Next-generation phylogenomics , 2013, Biology Direct.

[25]  John K. Goutsias,et al.  Thermodynamically consistent Bayesian analysis of closed biochemical reaction systems , 2010, BMC Bioinformatics.

[26]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[27]  T. Warnow Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent , 2012, PLoS currents.

[28]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[29]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[30]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.

[31]  Manolis Kellis,et al.  Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss , 2012, Bioinform..

[32]  E. Braun,et al.  Testing hypotheses about the sister group of the passeriformes using an independent 30-locus data set. , 2012, Molecular biology and evolution.

[33]  P. Waddell,et al.  Plastid Genome Phylogeny and a Model of Amino Acid Substitution for Proteins Encoded by Chloroplast DNA , 2000, Journal of Molecular Evolution.

[34]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[35]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[36]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction , 2010, RECOMB.

[37]  Martin Vingron,et al.  Modeling Amino Acid Replacement , 2000, J. Comput. Biol..

[38]  Benjamin D. Redelings,et al.  BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny , 2006, Bioinform..

[39]  Kevin J. Liu,et al.  RAxML and FastTree: Comparing Two Methods for Large-Scale Maximum Likelihood Phylogeny Estimation , 2011, PloS one.

[40]  Tandy J. Warnow,et al.  DACTAL: divide-and-conquer trees (almost) without alignments , 2012, Bioinform..

[41]  Gajendra PS Raghava,et al.  Designing of interferon-gamma inducing MHC class-II binders , 2013, Biology Direct.

[42]  Alexandros Stamatakis,et al.  Phylogenetic models of rate heterogeneity: a high performance computing perspective , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[43]  Tandy J. Warnow,et al.  MRL and SuperFine+MRL: new supertree methods , 2012, Algorithms for Molecular Biology.

[44]  C. Randal Linder,et al.  Benchmark datasets and software for developing and testing methods for large-scale multiple sequence alignment and phylogenetic inference , 2010, PLoS currents.

[45]  M. Gouy,et al.  Genome-scale coestimation of species and gene trees , 2013, Genome research.

[46]  John R. Davidson,et al.  SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction , 2010, Nucleic Acids Res..

[47]  Cynthia A. Phillips,et al.  The Asymmetric Median Tree - A New Model for Building Consensus Trees , 1996, Discret. Appl. Math..

[48]  W. Pearson,et al.  Exploring the relationship between sequence similarity and accurate phylogenetic trees. , 2006, Molecular biology and evolution.

[49]  W. Maddison Gene Trees in Species Trees , 1997 .

[50]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[51]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[52]  João Luís Sobral,et al.  Parallelizing SuperFine , 2012, SAC '12.

[53]  Kazutaka Katoh,et al.  Adding unaligned sequences into an existing alignment using MAFFT and LAST , 2012, Bioinform..

[54]  D. Posada,et al.  Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[55]  Noah A Rosenberg,et al.  Gene tree discordance, phylogenetic inference and the multispecies coalescent. , 2009, Trends in ecology & evolution.

[56]  Cédric Notredame,et al.  Upcoming challenges for multiple sequence alignment methods in the high-throughput era , 2009, Bioinform..

[57]  Tandy J. Warnow,et al.  Naive binning improves phylogenomic analyses , 2013, Bioinform..

[58]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[59]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..

[60]  C. Randal Linder,et al.  Multiple sequence alignment: a major challenge to large-scale phylogenetics , 2011, PLoS currents.

[61]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.

[62]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[63]  Tandy J. Warnow,et al.  Estimating Optimal Species Trees from Incomplete Gene Trees Under Deep Coalescence , 2012, J. Comput. Biol..

[64]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[65]  J. Huelsenbeck,et al.  MRBAYES : Bayesian inference of phylogeny , 2001 .

[66]  Ari Löytynoja,et al.  An algorithm for progressive multiple alignment of sequences with insertions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[67]  Matthew A. Gitzendanner,et al.  Phylogenetic placement of the enigmatic and critically endangered genus Saniculiphyllum (Saxifragaceae) inferred from combined analysis of plastid and nuclear DNA sequences. , 2012, Molecular phylogenetics and evolution.

[68]  Tandy J. Warnow,et al.  The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[69]  Tandy J. Warnow,et al.  FASTSP: linear time calculation of alignment accuracy , 2011, Bioinform..

[70]  O. Gascuel,et al.  An improved general amino acid replacement matrix. , 2008, Molecular biology and evolution.

[71]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[72]  W. Wheeler,et al.  POY version 4: phylogenetic analysis using dynamic homologies , 2010, Cladistics : the international journal of the Willi Hennig Society.

[73]  Tandy J. Warnow,et al.  Statistically based postprocessing of phylogenetic analysis by clustering , 2002, ISMB.

[74]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[75]  Tandy Warnow,et al.  SuperFine: fast and accurate supertree estimation. , 2012, Systematic biology.

[76]  Tandy Warnow,et al.  Treelength Optimization for Phylogeny Estimation , 2012, PloS one.

[77]  D A Morrison,et al.  Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. , 1997, Molecular biology and evolution.

[78]  M. Rosenberg,et al.  Multiple sequence alignment accuracy and phylogenetic inference. , 2006, Systematic biology.

[79]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[80]  Tandy Warnow,et al.  Barking Up The Wrong Treelength: The Impact of Gap Penalty on Alignment and Tree Accuracy , 2009, TCBB.

[81]  Arndt von Haeseler,et al.  Simultaneous statistical multiple alignment and phylogeny reconstruction. , 2005, Systematic biology.

[82]  R. Nielsen,et al.  Synonymous and nonsynonymous rate variation in nuclear genes of mammals , 1998, Journal of Molecular Evolution.