Using Constrained-INC for Large-Scale Gene Tree and Species Tree Estimation

Incremental tree building (INC) is a new phylogeny estimation method that has been proven to be absolute fast converging under standard sequence evolution models. A variant of INC, called Constrained-INC, is designed for use in divide-and-conquer pipelines for phylogeny estimation where a set of species is divided into disjoint subsets, trees are computed on the subsets using a selected base method, and then the subset trees are combined together. We evaluate the accuracy of INC and Constrained-INC for gene tree and species tree estimation on simulated datasets, and compare it to similar pipelines using NJMerge (another method that merges disjoint trees). For gene tree estimation, we find that INC has very poor accuracy in comparison to standard methods, and even Constrained-INC (using maximum likelihood methods to compute constraint trees) does not match the accuracy of the better maximum likelihood methods. Results for species trees are somewhat different, with Constrained-INC coming close to the accuracy of the best species tree estimation methods, while being much faster; furthermore, using Constrained-INC allows species tree estimation methods to scale to large datasets within limited computational resources. Overall, this study exposes the benefits and limitations of divide-and-conquer strategies for large-scale phylogenetic tree estimation.

[1]  Sébastien Roch,et al.  A short proof that phylogenetic tree reconstruction by maximum likelihood is hard , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[3]  Tandy J. Warnow,et al.  NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and Its Application to Species Trees , 2018, RECOMB-CG.

[4]  P. Buneman A Note on the Metric Properties of Trees , 1974 .

[5]  Satish Rao,et al.  Using INC within Divide-and-Conquer Phylogeny Estimation - Datasets , 2019 .

[6]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[7]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[8]  Tandy Warnow,et al.  Disk covering methods improve phylogenomic analyses , 2014, BMC Genomics.

[9]  Tandy Warnow,et al.  SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. , 2018, Molecular phylogenetics and evolution.

[10]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[11]  Kevin J. Liu,et al.  RAxML and FastTree: Comparing Two Methods for Large-Scale Maximum Likelihood Phylogeny Estimation , 2011, PloS one.

[12]  Tandy J. Warnow,et al.  DACTAL: divide-and-conquer trees (almost) without alignments , 2012, Bioinform..

[13]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[14]  P. Erdös,et al.  Local Quartet Splits of a Binary Tree Infer All Quartet Splits Via One Dyadic Inference Rule , 1996, Comput. Artif. Intell..

[15]  Tandy J. Warnow,et al.  FastRFS: fast and accurate Robinson-Foulds Supertrees using constrained exact optimization , 2016, Bioinform..

[16]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[17]  Tandy J. Warnow,et al.  Long‐Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology‐Based Summary Methods , 2018, Systematic biology.

[18]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[19]  Allan Sly,et al.  Phase transition in the sample complexity of likelihood-based phylogeny inference , 2015, 1508.01964.

[20]  M. Steel,et al.  Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. , 2015, Theoretical population biology.

[21]  Tandy J. Warnow,et al.  Absolute convergence: true trees from short sequences , 2001, SODA '01.

[22]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.

[23]  Thomas Mailund,et al.  tqDist: a library for computing the quartet and triplet distances between binary or general trees , 2014, Bioinform..

[24]  D. Hillis,et al.  Taxon sampling and the accuracy of phylogenetic analyses , 2008 .

[25]  Tandy Warnow,et al.  To include or not to include: The impact of gene filtering on species tree estimation methods , 2017, bioRxiv.

[26]  Tandy Warnow,et al.  Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation , 2017 .

[27]  S. Jeffery Evolution of Protein Molecules , 1979 .

[28]  Joseph T. Chang,et al.  A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences. , 2006, Mathematical biosciences.

[29]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[30]  Kevin Atteson,et al.  The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction , 1999, Algorithmica.

[31]  Olivier Gascuel,et al.  FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program , 2015, Molecular biology and evolution.

[32]  James H. Degnan,et al.  Species Tree Inference from Gene Splits by Unrooted STAR Methods , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Satish Rao,et al.  Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy , 2019, Algorithms for Molecular Biology.

[34]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[35]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[36]  Nick Goldman,et al.  MAXIMUM LIKELIHOOD INFERENCE OF PHYLOGENETIC TREES, WITH SPECIAL REFERENCE TO A POISSON PROCESS MODEL OF DNA SUBSTITUTION AND TO PARSIMONY ANALYSES , 1990 .

[37]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015, bioRxiv.

[38]  Tandy J. Warnow,et al.  TreeMerge: a new method for improving the scalability of species tree estimation methods , 2019, Bioinform..

[39]  Tandy Warnow,et al.  Divide-and-Conquer Tree Estimation: Opportunities and Challenges , 2019, Bioinformatics and Phylogenetics.

[40]  Laura Salter Kubatko,et al.  Quartet Inference from SNP Data Under the Coalescent Model , 2014, Bioinform..

[41]  Tandy J. Warnow,et al.  A few logs suffice to build (almost) all trees (I) , 1999, Random Struct. Algorithms.

[42]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[43]  David Bryant,et al.  Next-generation sequencing reveals phylogeographic structure and a species tree for recent bird divergences. , 2009, Molecular phylogenetics and evolution.

[44]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[45]  Tandy Warnow,et al.  Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge , 2018, Algorithms for Molecular Biology.

[46]  A. G. Pedersen,et al.  Computational Molecular Evolution , 2013 .

[47]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[48]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[49]  Tandy Warnow,et al.  Unblended disjoint tree merging using GTM improves species tree estimation , 2019, BMC Genomics.