ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees

BackgroundEvolutionary histories can be discordant across the genome, and such discordances need to be considered in reconstructing the species phylogeny. ASTRAL is one of the leading methods for inferring species trees from gene trees while accounting for gene tree discordance. ASTRAL uses dynamic programming to search for the tree that shares the maximum number of quartet topologies with input gene trees, restricting itself to a predefined set of bipartitions.ResultsWe introduce ASTRAL-III, which substantially improves the running time of ASTRAL-II and guarantees polynomial running time as a function of both the number of species (n) and the number of genes (k). ASTRAL-III limits the bipartition constraint set (X) to grow at most linearly with n and k. Moreover, it handles polytomies more efficiently than ASTRAL-II, exploits similarities between gene trees better, and uses several techniques to avoid searching parts of the search space that are mathematically guaranteed not to include the optimal tree. The asymptotic running time of ASTRAL-III in the presence of polytomies is O(nk)1.726D$O\left ((nk)^{1.726} D \right)$ where D=O(nk) is the sum of degrees of all unique nodes in input trees. The running time improvements enable us to test whether contracting low support branches in gene trees improves the accuracy by reducing noise. In extensive simulations, we show that removing branches with very low support (e.g., below 10%) improves accuracy while overly aggressive filtering is harmful. We observe on a biological avian phylogenomic dataset of 14K genes that contracting low support branches greatly improve results.ConclusionsASTRAL-III is a faster version of the ASTRAL method for phylogenetic reconstruction and can scale up to 10,000 species. With ASTRAL-III, low support branches can be removed, resulting in improved accuracy.

[1]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[2]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[3]  Siavash Mirarab,et al.  Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies , 2016, Molecular biology and evolution.

[4]  M. Nei,et al.  Relationships between gene trees and species trees. , 1988, Molecular biology and evolution.

[5]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[6]  Tandy Warnow,et al.  Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. , 2016, Systematic biology.

[7]  Liang Liu,et al.  Estimating species trees from unrooted gene trees. , 2011, Systematic biology.

[8]  Saravanaraj N. Ayyampalayam,et al.  Phylotranscriptomic analysis of the origin and early diversification of land plants , 2014, Proceedings of the National Academy of Sciences.

[9]  B. Faircloth,et al.  Analysis of a Rapid Evolutionary Radiation Using Ultraconserved Elements: Evidence for a Bias in Some Multispecies Coalescent Methods. , 2016, Systematic biology.

[10]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[11]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[12]  Ling Wang,et al.  Mixture SNPs effect on phenotype in genome-wide association studies , 2015, BMC Genomics.

[13]  Tandy Warnow,et al.  On the Robustness to Gene Tree Estimation Error (or lack thereof) of Coalescent-Based Species Tree Methods. , 2015, Systematic biology.

[14]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[15]  Gonzalo Giribet,et al.  Nuclear genomic signals of the ‘microturbellarian’ roots of platyhelminth evolutionary innovation , 2015, eLife.

[16]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015, bioRxiv.

[17]  Evgeny M. Zdobnov,et al.  The Newick utilities: high-throughput phylogenetic tree processing in the Unix shell , 2010, Bioinform..

[18]  David Posada,et al.  SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees , 2015, bioRxiv.

[19]  Noah A Rosenberg,et al.  Gene tree discordance, phylogenetic inference and the multispecies coalescent. , 2009, Trends in ecology & evolution.

[20]  S. Edwards IS A NEW AND GENERAL THEORY OF MOLECULAR SYSTEMATICS EMERGING? , 2009, Evolution; international journal of organic evolution.

[21]  W. Maddison Gene Trees in Species Trees , 1997 .

[22]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[23]  Sen Song,et al.  Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model , 2012, Proceedings of the National Academy of Sciences.

[24]  Scott V Edwards,et al.  Implementing and testing the multispecies coalescent model: A valuable paradigm for phylogenomics. , 2016, Molecular phylogenetics and evolution.

[25]  John A Rhodes,et al.  Determining species tree topologies from clade probabilities under the coalescent. , 2011, Journal of theoretical biology.

[26]  John Gatesy,et al.  Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. , 2014, Molecular phylogenetics and evolution.

[27]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[28]  Chao Zhang,et al.  ASTRAL-III: Increased Scalability and Impacts of Contracting Low Support Branches , 2017, RECOMB-CG.

[29]  Manuel Lafond,et al.  On the Weighted Quartet Consensus problem , 2017, CPM.

[30]  Edward L. Braun,et al.  Error in Phylogenetic Estimation for Bushes in the Tree of Life , 2013 .

[31]  D. Pearl,et al.  Estimating species phylogenies using coalescence times among sequences. , 2009, Systematic biology.

[32]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[33]  Md. Shamsuzzoha Bayzid,et al.  Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses , 2014, PloS one.

[34]  A. Brachmann,et al.  The effector candidate repertoire of the arbuscular mycorrhizal fungus Rhizophagus clarus , 2016, BMC Genomics.

[35]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[36]  Siavash Mirarab,et al.  Anchoring quartet-based phylogenetic distances and applications to species tree reconstruction , 2016, BMC Genomics.

[37]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[38]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[39]  Tandy J. Warnow,et al.  Algorithms for MDC-Based Multi-Locus Phylogeny Inference: Beyond Rooted Binary Gene Trees on Single Alleles , 2011, J. Comput. Biol..

[40]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[41]  Md. Shamsuzzoha Bayzid,et al.  Statistical binning enables an accurate coalescent-based estimation of the avian tree , 2014, Science.

[42]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[43]  Raymond J. Moran,et al.  The Interrelationships of Placental Mammals and the Limits of Phylogenetic Inference , 2016, Genome biology and evolution.

[44]  John Gatesy,et al.  The gene tree delusion. , 2016, Molecular phylogenetics and evolution.

[45]  Alexey M. Kozlov,et al.  ExaML version 3: a tool for phylogenomic analyses on supercomputers , 2015, Bioinform..

[46]  Elchanan Mossel,et al.  Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  A. Rokas,et al.  Contentious relationships in phylogenomic studies can be driven by a handful of genes , 2017, Nature Ecology &Evolution.

[48]  Terence Tao,et al.  A Bound on Partitioning Clusters , 2017, Electron. J. Comb..

[49]  Tandy Warnow,et al.  Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer , 2015, bioRxiv.