Accurate large-scale phylogeny-aware alignment using BAli-Phy

MOTIVATION BAli-Phy, a popular Bayesian method that co-estimates multiple sequence alignments and phylogenetic trees, is a rigorous statistical method, but due to its computational requirements, it has generally been limited to relatively small datasets (at most about 100 sequences). Here we repurpose BAli-Phy as a ``phylogeny-aware" alignment method: we estimate the phylogeny from the input of unaligned sequences, and then use that as a fixed tree within BAli-Phy. RESULTS We show that this approach achieves high accuracy, greatly superior to Prank, the current most popular phylogeny-aware alignment method, and is even more accurate than MAFFT, one of the top performing alignment methods in common use. Furthermore, this approach can be used to align very large datasets (up to 1000 sequences in this study). AVAILABILITY See https://doi.org/10.13012/B2IDB-7863273_V1 for datasets used in this study. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Marek L Borowiec,et al.  AMAS: a fast tool for alignment manipulation and computing of summary statistics , 2016, PeerJ.

[2]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[3]  Tandy Warnow,et al.  Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets , 2018, Systematic biology.

[4]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[5]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..

[6]  Sean R. Eddy,et al.  Biological Sequence Analysis by Richard Durbin , 1998 .

[7]  B. Redelings,et al.  Align or not to align? Resolving species complexes within the Caloplaca saxicola group as a case study , 2011, Mycologia.

[8]  Bret Larget,et al.  BayesCAT: Bayesian co‐estimation of alignment and tree , 2014, Biometrics.

[9]  Tandy J. Warnow,et al.  FASTSP: linear time calculation of alignment accuracy , 2011, Bioinform..

[10]  Timo Lassmann,et al.  Kalign 3: multiple sequence alignment of large datasets , 2019, Bioinform..

[11]  Irmgard Krisai-Greilhuber,et al.  Iteratively Refined Guide Trees Help Improving Alignment and Phylogenetic Inference in the Mushroom Family Bolbitiaceae , 2013, PloS one.

[12]  István Miklós,et al.  StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees , 2008, Bioinform..

[13]  Toni Gabaldón,et al.  Measuring guide-tree dependency of inferred gaps in progressive aligners , 2013, Bioinform..

[14]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[15]  Jimin Pei,et al.  PROMALS: towards accurate multiple sequence alignments of distantly related proteins , 2007, Bioinform..

[16]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[17]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[18]  Hafid Laayouni,et al.  Large multiple sequence alignments with a root-to-leaf regressive method , 2019, Nature Biotechnology.

[19]  D. Kronauer,et al.  Comparative genomics and transcriptomics in ants provide new insights into the evolution and function of odorant binding and chemosensory proteins , 2014, BMC Genomics.

[20]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[21]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[22]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[23]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[24]  Olga Chernomor,et al.  IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era , 2020, Molecular biology and evolution.

[25]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[26]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[27]  Albert J. Vilella,et al.  Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm , 2012, Bioinform..

[28]  Tandy Warnow,et al.  Phylogeny Estimation Given Sequence Length Heterogeneity , 2020, Systematic biology.

[29]  Tandy J. Warnow,et al.  Ultra-large alignments using phylogeny-aware profiles , 2015, Genome Biology.

[30]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[31]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[32]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[33]  Benjamin D. Redelings Bali-Phy version 3: model-based co-estimation of alignment and phylogeny , 2021, Bioinform..

[34]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[35]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[36]  Tandy Warnow,et al.  MAGUS: Multiple sequence Alignment using Graph clUStering , 2020, Bioinform..

[37]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.