New Divide-and-Conquer Techniques for Large-Scale Phylogenetic Estimation

Over the last years, the availability of genomic sequence data from thousands of different species has led to hopes that a phylogenetic tree of all life might be achievable. Yet, the most accurate methods for estimating phylogenies are heuristics for NP-hard optimization problems, many of which are too computationally intensive to use on large datasets. Divide-and-conquer approaches have been proposed to address scalability to large datasets that divide the species into subsets, construct trees on subsets, and then merge the trees together. Prior approaches have divided species sets into overlapping subsets and used supertree methods to merge the subset trees, but limitations in supertree methods suggest this kind of divide-and-conquer approach is unlikely to provide scalability to ultra-large datasets. Recently, a new approach has been developed that divides the species dataset into disjoint subsets, computes trees on subsets, and then combines the subset trees using auxiliary information (e.g., a distance matrix). Here, we describe these strategies and their theoretical properties, present open problems, and discuss opportunities for impact in large-scale phylogenetic estimation using these and similar approaches.

[1]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[2]  Andrew Rambaut,et al.  Heterotachy and tree building: a case study with plastids and eubacteria. , 2006, Molecular biology and evolution.

[3]  Tandy Warnow,et al.  Disk covering methods improve phylogenomic analyses , 2014, BMC Genomics.

[4]  Satish Rao,et al.  Using INC Within Divide-and-Conquer Phylogeny Estimation , 2019, AlCoB.

[5]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[6]  David Fernández-Baca,et al.  iGTP: A software package for large-scale gene tree parsimony analysis , 2010, BMC Bioinformatics.

[7]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[8]  Sébastien Roch,et al.  A short proof that phylogenetic tree reconstruction by maximum likelihood is hard , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  David Fernández-Baca,et al.  Robinson-Foulds Supertrees , 2010, Algorithms for Molecular Biology.

[10]  M. Gouy,et al.  Genome-scale coestimation of species and gene trees , 2013, Genome research.

[11]  Laura Kubatko,et al.  Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. , 2014, Journal of theoretical biology.

[12]  Olivier Gascuel,et al.  FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program , 2015, Molecular biology and evolution.

[13]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[14]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[15]  Hervé Philippe,et al.  Evaluation of the models handling heterotachy in phylogenetic inference , 2007, BMC Evolutionary Biology.

[16]  W. Maddison Gene Trees in Species Trees , 1997 .

[17]  Bryan Kolaczkowski,et al.  Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous , 2004, Nature.

[18]  Tandy Warnow,et al.  BBCA: Improving the scalability of *BEAST using random binning , 2014, BMC Genomics.

[19]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[20]  Tandy J. Warnow,et al.  The Impact of Multiple Protein Sequence Alignment on Phylogenetic Estimation , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Jun Kawai,et al.  Heterotachy in Mammalian Promoter Evolution , 2006, PLoS genetics.

[22]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[23]  Tandy J. Warnow,et al.  MRL and SuperFine+MRL: new supertree methods , 2012, Algorithms for Molecular Biology.

[24]  Tandy Warnow,et al.  Divide-and-Conquer Tree Estimation: Opportunities and Challenges , 2019, Bioinformatics and Phylogenetics.

[25]  H. Philippe,et al.  Heterotachy, an important process of protein evolution. , 2002, Molecular biology and evolution.

[26]  Tandy Warnow,et al.  Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation , 2017 .

[27]  Tandy Warnow,et al.  Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge , 2019 .

[28]  Tandy J. Warnow,et al.  Absolute convergence: true trees from short sequences , 2001, SODA '01.

[29]  Jens Lagergren Combining polynomial running time and fast convergence for the disk-covering method , 2002, J. Comput. Syst. Sci..

[30]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[31]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[32]  David Fernández-Baca,et al.  Fast Local Search for Unrooted Robinson-Foulds Supertrees , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Ziheng Yang,et al.  Molecular Evolution: A Statistical Approach , 2014 .

[34]  Jens Lagergren,et al.  Species Tree Inference Using a Mixture Model. , 2015, Molecular biology and evolution.

[35]  Tandy J. Warnow,et al.  Long‐Branch Attraction in Species Tree Estimation: Inconsistency of Partitioned Likelihood and Topology‐Based Summary Methods , 2018, Systematic biology.

[36]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[37]  Laura Salter Kubatko,et al.  Quartet Inference from SNP Data Under the Coalescent Model , 2014, Bioinform..

[38]  Tandy J. Warnow,et al.  NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and Its Application to Species Trees , 2018, RECOMB-CG.

[39]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015, bioRxiv.

[40]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.

[41]  Tandy J. Warnow,et al.  Designing fast converging phylogenetic methods , 2001, ISMB.

[42]  F. Ronquist Matrix representation of trees, redundancy, and weighting , 1996 .

[43]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[44]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[45]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[46]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[47]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[48]  Liang Liu,et al.  Estimating species trees from unrooted gene trees. , 2011, Systematic biology.

[49]  Tandy J. Warnow,et al.  FastRFS: fast and accurate Robinson-Foulds Supertrees using constrained exact optimization , 2016, Bioinform..

[50]  Tandy J. Warnow,et al.  DACTAL: divide-and-conquer trees (almost) without alignments , 2012, Bioinform..

[51]  Kevin Atteson,et al.  The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction , 1999, Algorithmica.

[52]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[53]  Chao Zhang,et al.  ASTRAL-III: Increased Scalability and Impacts of Contracting Low Support Branches , 2017, RECOMB-CG.

[54]  Tandy Warnow,et al.  To include or not to include: The impact of gene filtering on species tree estimation methods , 2017, bioRxiv.

[55]  Tandy Warnow,et al.  SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. , 2018, Molecular phylogenetics and evolution.

[56]  M. Steel,et al.  Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. , 2015, Theoretical population biology.

[57]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..