Optimal Completion of Incomplete Gene Trees in Polynomial Time Using OCTAL

Here we introduce the Optimal Tree Completion Problem, a general optimization problem that involves completing an unrooted binary tree (i.e., adding missing leaves) so as to minimize its distance from a reference tree on a superset of the leaves. More formally, given a pair of unrooted binary trees (T,t) where T has leaf set S and t has leaf set R, a subset of S, we wish to add all the leaves from S \ R to t so as to produce a new tree t' on leaf set S that has the minimum distance to T. We show that when the distance is defined by the Robinson-Foulds (RF) distance, an optimal solution can be found in polynomial time. We also present OCTAL, an algorithm that solves this RF Optimal Tree Completion Problem exactly in quadratic time. We report on a simulation study where we complete estimated gene trees using a reference tree that is based on a species tree estimated from a multi-locus dataset. OCTAL produces completed gene trees that are closer to the true gene trees than an existing heuristic approach, but the accuracy of the completed gene trees computed by OCTAL depends on how topologically similar the estimated species tree is to the true gene tree. Hence, under conditions with relatively low gene tree heterogeneity, OCTAL can be used to provide highly accurate completions of estimated gene trees. We close with a discussion of future research.

[1]  John A Rhodes,et al.  Split Probabilities and Species Tree Inference Under the Multispecies Coalescent Model , 2017, Bulletin of mathematical biology.

[2]  Tandy Warnow,et al.  To include or not to include: The impact of gene filtering on species tree estimation methods , 2017, bioRxiv.

[3]  Erin K. Molloy,et al.  Datasets from the study: Optimal completion of incomplete gene trees in polynomial time using OCTAL , 2017 .

[4]  Travis C Glenn,et al.  Avoiding Missing Data Biases in Phylogenomic Inference: An Empirical Study in the Landfowl (Aves: Galliformes). , 2016, Molecular biology and evolution.

[5]  David Posada,et al.  SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees , 2015, bioRxiv.

[6]  J. Wiens,et al.  How Should Genes and Taxa be Sampled for Phylogenomic Analyses with Missing Data? An Empirical Study in Iguanian Lizards. , 2016, Systematic biology.

[7]  Siavash Mir arabbaygi Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction , 2015 .

[8]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015, bioRxiv.

[9]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[10]  Md. Shamsuzzoha Bayzid,et al.  Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses , 2014, PloS one.

[11]  Md. Shamsuzzoha Bayzid,et al.  Statistical binning enables an accurate coalescent-based estimation of the avian tree , 2014, Science.

[12]  Mike A. Steel,et al.  Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading , 2014, ArXiv.

[13]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[14]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[15]  Mike Steel,et al.  Phylogenomics with incomplete taxon coverage: the limits to inference , 2010, BMC Evolutionary Biology.

[16]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[17]  J. G. Burleigh,et al.  Inferring phylogenies with incomplete data sets: a 5-gene, 567-taxon analysis of angiosperms , 2009, BMC Evolutionary Biology.

[18]  M. Kennedy,et al.  SEABIRD SUPERTREES: COMBINING PARTIAL ESTIMATES OF PROCELLARIIFORM PHYLOGENY , 2002 .

[19]  W. Maddison Gene Trees in Species Trees , 1997 .

[20]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .