OCTAL: Optimal Completion of gene trees in polynomial time

BackgroundFor a combination of reasons (including data generation protocols, approaches to taxon and gene sampling, and gene birth and loss), estimated gene trees are often incomplete, meaning that they do not contain all of the species of interest. As incomplete gene trees can impact downstream analyses, accurate completion of gene trees is desirable.ResultsWe introduce the Optimal Tree Completion problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. We present OCTAL, an algorithm that finds an optimal solution to this problem when the distance between trees is defined using the Robinson–Foulds (RF) distance, and we prove that OCTAL runs in $$O(n^2)$$O(n2) time, where n is the total number of species. We report on a simulation study in which gene trees can differ from the species tree due to incomplete lineage sorting, and estimated gene trees are completed using OCTAL with a reference tree based on a species tree estimated from the multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach in ASTRAL-II, but the accuracy of a completed gene tree computed by OCTAL depends on how topologically similar the reference tree (typically an estimated species tree) is to the true gene tree.ConclusionsOCTAL is a useful technique for adding missing taxa to incomplete gene trees and provides good accuracy under a wide range of model conditions. However, results show that OCTAL’s accuracy can be reduced when incomplete lineage sorting is high, as the reference tree can be far from the true gene tree. Hence, this study suggests that OCTAL would benefit from using other types of reference trees instead of species trees when there are large topological distances between true gene trees and species trees.

[1]  Dannie Durand,et al.  A hybrid micro-macroevolutionary approach to gene tree reconstruction. , 2006 .

[2]  Loren H. Rieseberg,et al.  Hybrid Origins of Plant Species , 1997 .

[3]  David Posada,et al.  SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees , 2015, bioRxiv.

[4]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[5]  Dannie Durand,et al.  Fast Heuristics for Resolving Weakly Supported Branches Using Duplication, Transfers, and Losses , 2017, RECOMB-CG.

[6]  Benjamin D. Redelings,et al.  BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny , 2006, Bioinform..

[7]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[8]  Siavash Mir arabbaygi Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction , 2015 .

[9]  Thomas Mailund,et al.  QDist-quartet distance between evolutionary trees , 2004, Bioinform..

[10]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[11]  Dan Gusfield,et al.  ReCombinatorics: The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks , 2014 .

[12]  Tandy Warnow,et al.  To include or not to include: The impact of gene filtering on species tree estimation methods , 2017 .

[13]  Travis C Glenn,et al.  Avoiding Missing Data Biases in Phylogenomic Inference: An Empirical Study in the Landfowl (Aves: Galliformes). , 2016, Molecular biology and evolution.

[14]  Nadia El-Mabrouk,et al.  Efficient Gene Tree Correction Guided by Genome Evolution , 2016, PloS one.

[15]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[16]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[17]  Steven Kelk,et al.  Phylogenetic Networks: Concepts, Algorithms and Applications , 2012 .

[18]  Liang Liu,et al.  The Impact of Missing Data on Species Tree Estimation. , 2016, Molecular biology and evolution.

[19]  Tandy J. Warnow,et al.  Gene Tree Parsimony for Incomplete Gene Trees , 2017, WABI.

[20]  M. Steel,et al.  Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. , 2015, Theoretical population biology.

[21]  Mike Steel,et al.  Phylogenomics with incomplete taxon coverage: the limits to inference , 2010, BMC Evolutionary Biology.

[22]  John A Rhodes,et al.  Split Probabilities and Species Tree Inference Under the Multispecies Coalescent Model , 2017, Bulletin of mathematical biology.

[23]  W. Maddison Gene Trees in Species Trees , 1997 .

[24]  Yu Lin,et al.  A Metric for Phylogenetic Trees Based on Matching , 2011, ISBRA.

[25]  L. Knowles,et al.  Unforeseen Consequences of Excluding Missing Data from Next-Generation Sequences: Simulation Study of RAD Sequences. , 2016, Systematic biology.

[26]  Steven Kelk,et al.  Networks: expanding evolutionary thinking. , 2013, Trends in genetics : TIG.

[27]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015 .

[28]  Chao Zhang,et al.  ASTRAL-III: Increased Scalability and Impacts of Contracting Low Support Branches , 2017, RECOMB-CG.

[29]  Michael G. Nute,et al.  Statistical Consistency of Coalescent-Based Species Tree Methods Under Models of Missing Data , 2017, RECOMB-CG.

[30]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[31]  J. Wiens,et al.  How Should Genes and Taxa be Sampled for Phylogenomic Analyses with Missing Data? An Empirical Study in Iguanian Lizards. , 2016, Systematic biology.

[32]  Daniel H. Huson,et al.  Phylogenetic Networks: Contents , 2010 .

[33]  Daniel H. Huson,et al.  Phylogenetic Networks: Introduction to phylogenetic networks , 2010 .

[34]  M. Kennedy,et al.  SEABIRD SUPERTREES: COMBINING PARTIAL ESTIMATES OF PROCELLARIIFORM PHYLOGENY , 2002 .

[35]  David Posada,et al.  Phylogenomics for Systematic Biology. , 2016, Systematic biology.

[36]  Louis J. Billera,et al.  Geometry of the Space of Phylogenetic Trees , 2001, Adv. Appl. Math..

[37]  W. Doolittle,et al.  Prokaryotic evolution in light of gene transfer. , 2002, Molecular biology and evolution.

[38]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[39]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[40]  J. G. Burleigh,et al.  Inferring phylogenies with incomplete data sets: a 5-gene, 567-taxon analysis of angiosperms , 2009, BMC Evolutionary Biology.