Divide-and-Conquer Tree Estimation: Opportunities and Challenges

Large-scale phylogeny estimation is challenging for many reasons, including heterogeneity across the Tree of Life and the difficulty in finding good solutions to NP-hard optimization problems. One of the promising ways for enabling large-scale phylogeny estimation is through divide-and-conquer: a dataset is divided into overlapping subsets, trees are estimated on the subsets, and then the subset trees are merged together into a tree on the full set of taxa. This last step is achieved through the use of a supertree method, which is popular in systematics for use in combining species trees from the scientific literature. Because most supertree methods are heuristics for NP-hard optimization problems, the use of supertree estimation on large datasets is challenging, both in terms of scalability and accuracy. In this chapter, we describe the current state of the art in supertree construction and the use of supertree methods in divide-and-conquer strategies, and we identify directions where future research could lead to improved supertree methods. Finally, we present a new type of divide-and-conquer strategy that bypasses the need for supertree estimation, in which the division into subsets produces disjoint subsets. Overall, this chapter aims to present directions for research that will potentially lead to new methods to scale phylogeny estimation methods to large datasets.

[1]  Mike A. Steel,et al.  Algorithmic Aspects of Tree Amalgamation , 2000, J. Algorithms.

[2]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[3]  Mike Steel,et al.  Maximum likelihood supertrees. , 2007, Systematic biology.

[4]  Manuel Lafond,et al.  On the Weighted Quartet Consensus problem , 2017, CPM.

[5]  Olivier Gascuel,et al.  Fast NJ-like algorithms to deal with incomplete distance matrices , 2008, BMC Bioinformatics.

[6]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[7]  Joseph T. Chang,et al.  A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences. , 2006, Mathematical biosciences.

[8]  Lior Pachter,et al.  Why Neighbor-Joining Works , 2006, Algorithmica.

[9]  Nir Ailon,et al.  Fitting Tree Metrics: Hierarchical Clustering and Phylogeny , 2011, SIAM J. Comput..

[10]  Luay Nakhleh,et al.  Species Tree Inference by Minimizing Deep Coalescences , 2009, PLoS Comput. Biol..

[11]  Mark D. Wilkinson,et al.  L.U.St: a tool for approximated maximum likelihood supertree reconstruction , 2014, BMC Bioinformatics.

[12]  Elchanan Mossel,et al.  Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  M. Steel,et al.  Impacts of Terraces on Phylogenetic Inference. , 2014, Systematic biology.

[14]  J. L. Gittleman,et al.  Building large trees by combining phylogenetic information: a complete phylogeny of the extant Carnivora (Mammalia) , 1999, Biological reviews of the Cambridge Philosophical Society.

[15]  Daniel H. Huson,et al.  Solving Large Scale Phylogenetic Problems using DCM2 , 1999, ISMB.

[16]  Davide Pisani,et al.  Supertrees disentangle the chimerical origin of eukaryotic genomes. , 2007, Molecular biology and evolution.

[17]  S. Snir,et al.  The Performance of Two Supertree Schemes Compared Using Synthetic and Real Data Quartet Input , 2018, Journal of Molecular Evolution.

[18]  Noga Alon,et al.  On the Compatibility of Quartet Trees , 2014, SIAM J. Discret. Math..

[19]  T. Dobzhansky Nothing in Biology Makes Sense Except in the Light of Evolution , 1973 .

[20]  Daniel Doerr,et al.  Orthology Detection Combining Clustering and Synteny for Very Large Datasets , 2014, PloS one.

[21]  J. Cotton,et al.  Supertrees join the mainstream of phylogenetics. , 2009, Trends in ecology & evolution.

[22]  João Luís Sobral,et al.  Parallelizing SuperFine , 2012, SAC '12.

[23]  M. Ragan Phylogenetic inference based on matrix representation of trees. , 1992, Molecular phylogenetics and evolution.

[24]  Barbara R. Holland,et al.  Imputing supertrees and supernetworks from quartets. , 2007 .

[25]  Tandy J. Warnow,et al.  NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and Its Application to Species Trees , 2018, RECOMB-CG.

[26]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015, bioRxiv.

[27]  Kevin J. Liu,et al.  RAxML and FastTree: Comparing Two Methods for Large-Scale Maximum Likelihood Phylogeny Estimation , 2011, PloS one.

[28]  Allen G. Rodrigo,et al.  A comment on Baum's method for combining phylogenetic trees , 1993 .

[29]  O. Bininda-Emonds Phylogenetic Supertrees: Combining Information To Reveal The Tree Of Life , 2004 .

[30]  Dan Pelleg,et al.  Constructing Phylogenies from Quartets: Elucidation of Eutherian Superordinal Relationships , 1998, J. Comput. Biol..

[31]  S. Böcker,et al.  Collecting reliable clades using the Greedy Strict Consensus Merger , 2016, PeerJ Prepr..

[32]  Tandy J. Warnow,et al.  Ultra-large alignments using phylogeny-aware profiles , 2015, Genome Biology.

[33]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[34]  Vincent Ranwez,et al.  SuperTriplets: a triplet-based supertree approach to phylogenomics , 2010, Bioinform..

[35]  Satish Rao,et al.  A tight bound on approximating arbitrary metrics by tree metrics , 2003, STOC '03.

[36]  Tandy J. Warnow,et al.  Inferring Optimal Species Trees Under Gene Duplication and Loss , 2013, Pacific Symposium on Biocomputing.

[37]  Mikkel Thorup,et al.  On the approximability of numerical taxonomy (fitting distances by tree metrics) , 1996, SODA '96.

[38]  Tandy Warnow,et al.  Disk covering methods improve phylogenomic analyses , 2014, BMC Genomics.

[39]  Tandy J. Warnow,et al.  The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  K. Jønsson,et al.  A phylogenetic supertree of oscine passerine birds (Aves: Passeri) , 2006 .

[41]  Jijun Tang,et al.  Scaling up accurate phylogenetic reconstruction from gene-order data , 2003, ISMB.

[42]  Michael J Benton,et al.  A genus-level supertree of the Dinosauria , 2002, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[43]  François-Joseph Lapointe,et al.  Properties of supertree methods in the consensus setting. , 2007, Systematic biology.

[44]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[45]  João Luís Sobral,et al.  Parallel SuperFine - A tool for fast and accurate supertree estimation: Features and limitations , 2017, Future Gener. Comput. Syst..

[46]  Noah A. Rosenberg,et al.  iGLASS: An Improvement to the GLASS Method for Estimating Species Trees from Gene Trees , 2012, J. Comput. Biol..

[47]  James H. Degnan,et al.  Species Tree Inference from Gene Splits by Unrooted STAR Methods , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  Sebastian Böcker,et al.  Polynomial Supertree Methods Revisited , 2011, Adv. Bioinformatics.

[49]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[50]  Leonardo de Oliveira Martins,et al.  Species Tree Estimation from Genome-wide Data with Guenomu , 2015, bioRxiv.

[51]  François-Joseph Lapointe,et al.  THE AVERAGE CONSENSUS PROCEDURE: COMBINATION OF WEIGHTED TREES CONTAINING IDENTICAL OR OVERLAPPING SETS OF TAXA , 1997 .

[52]  Tandy Warnow,et al.  SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. , 2018, Molecular phylogenetics and evolution.

[53]  Mark D. Wilkinson,et al.  A view of supertree methods , 2001, Bioconsensus.

[54]  M. Steel,et al.  Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. , 2015, Theoretical population biology.

[55]  David Fernández-Baca,et al.  Improved Heuristics for Minimum-Flip Supertree Construction , 2006, Evolutionary bioinformatics online.

[56]  Mark Wilkinson,et al.  Majority-rule supertrees. , 2007, Systematic biology.

[57]  Tandy J. Warnow,et al.  An experimental study of Quartets MaxCut and other supertree methods , 2010, Algorithms for Molecular Biology.

[58]  Kimmen Sjölander,et al.  Ortholog identification in the presence of domain architecture rearrangement , 2011, Briefings Bioinform..

[59]  Mark A. Ragan,et al.  The MRP Method , 2004 .

[60]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[61]  Katharina T. Huber,et al.  ape 3.0: New tools for distance-based phylogenetics and evolutionary analysis in R , 2012, Bioinform..

[62]  Michael G. Nute,et al.  Scaling statistical multiple sequence alignment to large datasets , 2016, BMC Genomics.

[63]  O. Gascuel,et al.  Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. , 2003, Molecular biology and evolution.

[64]  T. Warnow,et al.  SIESTA: enhancing searches for optimal supertrees and species trees , 2018, BMC Genomics.

[65]  Chao Zhang,et al.  ASTRAL-III: Increased Scalability and Impacts of Contracting Low Support Branches , 2017, RECOMB-CG.

[66]  D. Posada,et al.  A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction , 2014, Systematic biology.

[67]  Magnus Bordewich,et al.  Accuracy Guarantees for Phylogeny Reconstruction Algorithms Based on Balanced Minimum Evolution , 2010, WABI.

[68]  J. L. Gittleman,et al.  The (Super)Tree of Life: Procedures, Problems, and Prospects , 2002 .

[69]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[70]  David Fernández-Baca,et al.  MulRF: a software package for phylogenetic analysis using multi-copy gene trees , 2015, Bioinform..

[71]  Tandy Warnow,et al.  Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation , 2017 .

[72]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[73]  Rolf Niedermeier,et al.  A fixed-parameter algorithm for minimum quartet inconsistency , 2003, J. Comput. Syst. Sci..

[74]  E. Vrba,et al.  A complete estimate of the phylogenetic relationships in Ruminantia: a dated species‐level supertree of the extant ruminants , 2005, Biological reviews of the Cambridge Philosophical Society.

[75]  Mike Steel,et al.  Terraces in Phylogenetic Tree Space , 2011, Science.

[76]  Pamela S Soltis,et al.  Darwin's abominable mystery: Insights from a supertree of the angiosperms , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[77]  Michael T. Hallett,et al.  New algorithms for the duplication-loss model , 2000, RECOMB '00.

[78]  Tandy Warnow,et al.  SuperFine: fast and accurate supertree estimation. , 2012, Systematic biology.

[79]  Mike Steel,et al.  Phylogenomics with incomplete taxon coverage: the limits to inference , 2010, BMC Evolutionary Biology.

[80]  Sagi Snir,et al.  Triplet MaxCut: a new toolkit for rooted supertree , 2016 .

[81]  W. Maddison Gene Trees in Species Trees , 1997 .

[82]  Tao Jiang,et al.  A Polynomial Time Approximation Scheme for Inferring Evolutionary Trees from Quartet Topologies and Its Application , 2001, SIAM J. Comput..

[83]  Bernard M. E. Moret,et al.  Performance of Supertree Methods on Various Data Set Decompositions , 2004 .

[84]  Charles Semple,et al.  A supertree method for rooted trees , 2000, Discret. Appl. Math..

[85]  David Fernández-Baca,et al.  Flipping: A supertree construction method , 2001, Bioconsensus.

[86]  T. Davies,et al.  Using Supertrees to Investigate Species Richness in Grasses and Flowering Plants , 2004 .

[87]  Bernard M. E. Moret,et al.  Rec-I-DCM3: a fast algorithmic technique for reconstructing phylogenetic trees , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[88]  Olga Chernomor,et al.  Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices , 2016, Systematic biology.

[89]  W. A. Beyer,et al.  Additive evolutionary trees. , 1977, Journal of theoretical biology.

[90]  David Fernández-Baca,et al.  Minimum-flip supertrees: complexity and algorithms , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[91]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[92]  Mike A. Steel,et al.  Constructing Optimal Trees from Quartets , 2001, J. Algorithms.

[93]  Adrian M. Altenhoff,et al.  Standardized benchmarking in the quest for orthologs , 2016, Nature Methods.

[94]  Sylvain Guillemot,et al.  PhySIC: a veto supertree method with desirable properties. , 2007, Systematic biology.

[95]  Tandy J. Warnow,et al.  OCTAL: Optimal Completion of gene trees in polynomial time , 2018, Algorithms for Molecular Biology.

[96]  K. Huber,et al.  Reconstructing (super)trees from data sets with missing distances: not all is lost. , 2015, Molecular biology and evolution.

[97]  Christopher J. Creevey,et al.  Implementing and testing Bayesian and maximum-likelihood supertree methods in phylogenetics , 2015, Royal Society Open Science.

[98]  M. Chase,et al.  Complete generic-level phylogenetic analyses of palms (Arecaceae) with comparisons of supertree and supermatrix approaches. , 2009, Systematic biology.

[99]  Rezwana Reaz,et al.  Accurate Phylogenetic Tree Reconstruction from Quartets: A Heuristic Approach , 2014, PloS one.

[100]  A. Purvis,et al.  A phylogenetic supertree of the bats (Mammalia: Chiroptera) , 2002, Biological reviews of the Cambridge Philosophical Society.

[101]  David Fernández-Baca,et al.  iGTP: A software package for large-scale gene tree parsimony analysis , 2010, BMC Bioinformatics.

[102]  Pablo A. Goloboff,et al.  TNT, a free program for phylogenetic analysis , 2008 .

[103]  Colin N. Dewey,et al.  BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis , 2010, Bioinform..

[104]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[105]  A. Kupczok Split-based computation of majority-rule supertrees , 2011, BMC Evolutionary Biology.

[106]  David Fernández-Baca,et al.  Robinson-Foulds Supertrees , 2010, Algorithms for Molecular Biology.

[107]  Satish Rao,et al.  Using Max Cut to Enhance Rooted Trees Consistency , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[108]  O. Gascuel,et al.  Quartet-based phylogenetic inference: improvements and limits. , 2001, Molecular biology and evolution.

[109]  J. McInerney,et al.  Trees from trees: construction of phylogenetic supertrees using clann. , 2009, Methods in molecular biology.

[110]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[111]  Olivier Gascuel,et al.  FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program , 2015, Molecular biology and evolution.

[112]  Fred R. McMorris,et al.  Axioms for consensus functions on undirected phylogenetic trees , 1985 .

[113]  G. Hormiga,et al.  Phylogenetic placement of the Tasmanian spider Acrobleps hygrophilus (Araneae, Anapidae) with comments on the evolution of the capture web in Araneoidea , 2008 .

[114]  Liang Liu,et al.  Estimating species trees from unrooted gene trees. , 2011, Systematic biology.

[115]  Saravanaraj N. Ayyampalayam,et al.  Phylotranscriptomic analysis of the origin and early diversification of land plants , 2014, Proceedings of the National Academy of Sciences.

[116]  Tandy J. Warnow,et al.  FastRFS: fast and accurate Robinson-Foulds Supertrees using constrained exact optimization , 2016, Bioinform..

[117]  Tandy J. Warnow,et al.  DACTAL: divide-and-conquer trees (almost) without alignments , 2012, Bioinform..

[118]  Bin Ma,et al.  A new quartet approach for reconstructing phylogenetic trees: quartet joining method , 2007, J. Comb. Optim..

[119]  S. Edwards IS A NEW AND GENERAL THEORY OF MOLECULAR SYSTEMATICS EMERGING? , 2009, Evolution; international journal of organic evolution.

[120]  Bernard M. E. Moret,et al.  New Software for Computational Phylogenetics , 2002 .

[121]  S. J. Willson,et al.  Constructing rooted supertrees using distances , 2004, Bulletin of mathematical biology.

[122]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[123]  M. Steel,et al.  Computing the Distribution of a Tree Metric , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[124]  David Fernández-Baca,et al.  Fast Local Search for Unrooted Robinson-Foulds Supertrees , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[125]  Sagi Snir,et al.  Weighted quartets phylogenetics. , 2015, Systematic biology.

[126]  Olivier Gascuel,et al.  Inferring evolutionary trees with strong combinatorial evidence , 2000, Theor. Comput. Sci..

[127]  B. Baum Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees , 1992 .

[128]  Takeya Shigezumi,et al.  Robustness of Greedy Type Minimum Evolution Algorithms , 2006, International Conference on Computational Science.

[129]  Olivier Gascuel,et al.  SDM: a fast distance-based approach for (super) tree building in phylogenomics. , 2006, Systematic biology.

[130]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[131]  Tandy J. Warnow,et al.  MRL and SuperFine+MRL: new supertree methods , 2012, Algorithms for Molecular Biology.

[132]  S. Böcker,et al.  Bad Clade Deletion Supertrees: A Fast and Accurate Supertree Algorithm , 2017, Molecular biology and evolution.

[133]  Travis J. Wheeler,et al.  Large-Scale Neighbor-Joining with NINJA , 2009, WABI.

[134]  Michael J. Sanderson,et al.  EVOLUTION OF GENOME SIZE IN PINES (PINUS) AND ITS LIFE‐HISTORY CORRELATES: SUPERTREE ANALYSES , 2004, Evolution; international journal of organic evolution.

[135]  H. Philippe,et al.  Heterotachy, an important process of protein evolution. , 2002, Molecular biology and evolution.

[136]  R. Graham,et al.  The steiner problem in phylogeny is NP-complete , 1982 .

[137]  Tandy J. Warnow,et al.  Absolute convergence: true trees from short sequences , 2001, SODA '01.

[138]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[139]  Oliver Eulenstein,et al.  DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony , 2008, Bioinform..

[140]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[141]  Satish Rao,et al.  Quartets MaxCut: A Divide and Conquer Quartets Algorithm , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[142]  O. Gascuel,et al.  Neighbor-joining revealed. , 2006, Molecular biology and evolution.

[143]  B. Boussau,et al.  Efficient Exploration of the Space of Reconciled Gene Trees , 2013, Systematic biology.

[144]  J. Huelsenbeck,et al.  Application and accuracy of molecular phylogenies. , 1994, Science.

[145]  Tandy J. Warnow,et al.  Algorithms for MDC-Based Multi-Locus Phylogeny Inference: Beyond Rooted Binary Gene Trees on Single Alleles , 2011, J. Comput. Biol..

[146]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[147]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[148]  Olivier Gascuel,et al.  Combinatorics of distance-based tree inference , 2012, Proceedings of the National Academy of Sciences.

[149]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[150]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[151]  Sophie S Abby,et al.  Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations , 2012, Proceedings of the National Academy of Sciences.