TreeMerge: a new method for improving the scalability of species tree estimation methods

Abstract Motivation At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species. Results Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All. Availability and implementation TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge). Supplementary information Supplementary data are available at Bioinformatics online.

[1]  David Posada,et al.  SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees , 2015, bioRxiv.

[2]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[3]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[4]  Tandy Warnow,et al.  Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation , 2017 .

[5]  Tandy J. Warnow,et al.  Long‐Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology‐Based Summary Methods , 2018, Systematic biology.

[6]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[7]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[8]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015, bioRxiv.

[9]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[10]  Sébastien Roch,et al.  A short proof that phylogenetic tree reconstruction by maximum likelihood is hard , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[12]  Tandy Warnow,et al.  Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. , 2016, Systematic biology.

[13]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[14]  Tandy Warnow,et al.  SuperFine: fast and accurate supertree estimation. , 2012, Systematic biology.

[15]  M. Gouy,et al.  Genome-scale coestimation of species and gene trees , 2013, Genome research.

[16]  Olivier Gascuel,et al.  FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program , 2015, Molecular biology and evolution.

[17]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[18]  Chao Zhang,et al.  ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees , 2018, BMC Bioinformatics.

[19]  Tandy Warnow,et al.  To include or not to include: The impact of gene filtering on species tree estimation methods , 2017, bioRxiv.

[20]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[21]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[22]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[23]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[24]  Michael T. Hallett,et al.  Simultaneous Identification of Duplications and Lateral Gene Transfers , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  W. Maddison Gene Trees in Species Trees , 1997 .

[26]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[27]  Tandy J. Warnow,et al.  Gene tree parsimony for incomplete gene trees: addressing true biological loss , 2017, Algorithms for Molecular Biology.

[28]  Maria Jesus Martin,et al.  Big data and other challenges in the quest for orthologs , 2014, Bioinform..

[29]  J. Townsend,et al.  A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing , 2015, Nature.

[30]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[31]  L. Nakhleh,et al.  Computational approaches to species phylogeny inference and gene tree reconciliation. , 2013, Trends in ecology & evolution.

[32]  Dannie Durand,et al.  Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees , 2012, Bioinform..

[33]  Liang Liu,et al.  Estimating species trees from unrooted gene trees. , 2011, Systematic biology.

[34]  Saravanaraj N. Ayyampalayam,et al.  Phylotranscriptomic analysis of the origin and early diversification of land plants , 2014, Proceedings of the National Academy of Sciences.

[35]  Tandy J. Warnow,et al.  DACTAL: divide-and-conquer trees (almost) without alignments , 2012, Bioinform..

[36]  Mark H. Ellisman,et al.  Alterations in mGluR5 Expression and Signaling in Lewy Body Disease and in Transgenic Models of Alpha-Synucleinopathy – Implications for Excitotoxicity , 2010, PloS one.

[37]  B. Faircloth,et al.  Analysis of a Rapid Evolutionary Radiation Using Ultraconserved Elements: Evidence for a Bias in Some Multispecies Coalescent Methods. , 2016, Systematic biology.

[38]  P. Waddell,et al.  Rapid Evaluation of Least-Squares and Minimum-Evolution Criteria on Phylogenetic Trees , 1998 .

[39]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[40]  Md. Shamsuzzoha Bayzid,et al.  Statistical binning enables an accurate coalescent-based estimation of the avian tree , 2014, Science.

[41]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[42]  James H. Degnan,et al.  Species Tree Inference from Gene Splits by Unrooted STAR Methods , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[43]  Oliver Eulenstein,et al.  Algorithms for Genome-Scale Phylogenetics Using Gene Tree Parsimony , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[44]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.