Using INC Within Divide-and-Conquer Phylogeny Estimation

In a recent paper (Zhang, Rao, and Warnow, Algorithms for Molecular Biology 2019), the INC (incremental tree building) algorithm was presented and proven to be absolute fast converging under standard sequence evolution models. A variant of INC which allows a set of disjoint constraint trees to be provided and then uses INC to merge the constraint trees was also presented (i.e., Constrained INC). We report on a study evaluating INC on a range of simulated datasets, and show that it has very poor accuracy in comparison to standard methods. We also explore the design space for divide-and-conquer strategies for phylogeny estimation that use Constrained INC, and show modifications that provide improved accuracy. In particular, we present INC-ML, a divide-and-conquer approach to maximum likelihood (ML) estimation that comes close to the leading ML heuristics in terms of accuracy, and is more accurate than the current best distance-based methods.

[1]  P. Buneman A Note on the Metric Properties of Trees , 1974 .

[2]  David Posada,et al.  SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees , 2015, bioRxiv.

[3]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[4]  Vladimir Makarenkov,et al.  T-REX: a web server for inferring, validating and visualizing phylogenetic trees and networks , 2012, Nucleic Acids Res..

[5]  Tandy J. Warnow,et al.  Absolute convergence: true trees from short sequences , 2001, SODA '01.

[6]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[7]  Tandy Warnow,et al.  Disk covering methods improve phylogenomic analyses , 2014, BMC Genomics.

[8]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[9]  Joseph T. Chang,et al.  A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences. , 2006, Mathematical biosciences.

[10]  Tandy J. Warnow,et al.  NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and Its Application to Species Trees , 2018, RECOMB-CG.

[11]  Tandy J. Warnow,et al.  DACTAL: divide-and-conquer trees (almost) without alignments , 2012, Bioinform..

[12]  Mark H. Ellisman,et al.  Alterations in mGluR5 Expression and Signaling in Lewy Body Disease and in Transgenic Models of Alpha-Synucleinopathy – Implications for Excitotoxicity , 2010, PloS one.

[13]  Satish Rao,et al.  Using INC within Divide-and-Conquer Phylogeny Estimation - Datasets , 2019 .

[14]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[15]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[16]  Siavash Mirarab,et al.  Fragmentary Gene Sequences Negatively Impact Gene Tree and Species Tree Reconstruction , 2017, Molecular biology and evolution.

[17]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[18]  Allan Sly,et al.  Phase transition in the sample complexity of likelihood-based phylogeny inference , 2015, 1508.01964.

[19]  Tandy Warnow,et al.  Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge , 2019 .

[20]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[21]  W. Maddison Gene Trees in Species Trees , 1997 .

[22]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[23]  P. Erdös,et al.  Local Quartet Splits of a Binary Tree Infer All Quartet Splits Via One Dyadic Inference Rule , 1996, Comput. Artif. Intell..

[24]  Tandy Warnow,et al.  Divide-and-Conquer Tree Estimation: Opportunities and Challenges , 2019, Bioinformatics and Phylogenetics.

[25]  Satish Rao,et al.  Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy , 2019, Algorithms for Molecular Biology.

[26]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.

[27]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[28]  Olivier Gascuel,et al.  FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program , 2015, Molecular biology and evolution.