Species Tree Inference by Minimizing Deep Coalescences

In a 1997 seminal paper, W. Maddison proposed minimizing deep coalescences, or MDC, as an optimization criterion for inferring the species tree from a set of incongruent gene trees, assuming the incongruence is exclusively due to lineage sorting. In a subsequent paper, Maddison and Knowles provided and implemented a search heuristic for optimizing the MDC criterion, given a set of gene trees. However, the heuristic is not guaranteed to compute optimal solutions, and its hill-climbing search makes it slow in practice. In this paper, we provide two exact solutions to the problem of inferring the species tree from a set of gene trees under the MDC criterion. In other words, our solutions are guaranteed to find the tree that minimizes the total number of deep coalescences from a set of gene trees. One solution is based on a novel integer linear programming (ILP) formulation, and another is based on a simple dynamic programming (DP) approach. Powerful ILP solvers, such as CPLEX, make the first solution appealing, particularly for very large-scale instances of the problem, whereas the DP-based solution eliminates dependence on proprietary tools, and its simplicity makes it easy to integrate with other genomic events that may cause gene tree incongruence. Using the exact solutions, we analyze a data set of 106 loci from eight yeast species, a data set of 268 loci from eight Apicomplexan species, and several simulated data sets. We show that the MDC criterion provides very accurate estimates of the species tree topologies, and that our solutions are very fast, thus allowing for the accurate analysis of genome-scale data sets. Further, the efficiency of the solutions allow for quick exploration of sub-optimal solutions, which is important for a parsimony-based criterion such as MDC, as we show. We show that searching for the species tree in the compatibility graph of the clusters induced by the gene trees may be sufficient in practice, a finding that helps ameliorate the computational requirements of optimization solutions. Further, we study the statistical consistency and convergence rate of the MDC criterion, as well as its optimality in inferring the species tree. Finally, we show how our solutions can be used to identify potential horizontal gene transfer events that may have caused some of the incongruence in the data, thus augmenting Maddison's original framework. We have implemented our solutions in the PhyloNet software package, which is freely available at: http://bioinfo.cs.rice.edu/phylonet.

[1]  N. D. Levine,et al.  Taxonomy and review of the coccidian genus Cryptosporidium (protozoa, apicomplexa). , 1984, The Journal of protozoology.

[2]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[3]  D. Maddison,et al.  Mesquite: a modular system for evolutionary analysis. Version 2.6 , 2009 .

[4]  J. Kissinger,et al.  The Apicomplexan Whole-Genome Phylogeny: An Analysis of Incongruence among Gene Trees , 2008, Molecular biology and evolution.

[5]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[6]  Luay Nakhleh,et al.  RIATA-HGT: A Fast and Accurate Heuristic for Reconstructing Horizontal Gene Transfer , 2005, COCOON.

[7]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[8]  J. Jeffry Howbert,et al.  The Maximum Clique Problem , 2007 .

[9]  Alan M. Moses,et al.  Widespread Discordance of Gene Trees with Species Tree in Drosophila: Evidence for Incomplete Lineage Sorting , 2006, PLoS genetics.

[10]  W. P. Maddison,et al.  Mesquite: a modular system for evolutionary analysis. Version 2.01 (Build j28) , 2007 .

[11]  Luay Nakhleh,et al.  Confounding Factors in HGT Detection: Statistical Error, Coalescent Effects, and Multiple Solutions , 2007, J. Comput. Biol..

[12]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[13]  W. Maddison,et al.  Inferring phylogeny despite incomplete lineage sorting. , 2006, Systematic biology.

[14]  Luay Nakhleh,et al.  PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships , 2008, BMC Bioinformatics.

[15]  Laura Salter Kubatko,et al.  Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. , 2009, Theoretical population biology.

[16]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[17]  D. Pearl,et al.  Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. , 2007, Systematic biology.

[18]  Luay Nakhleh,et al.  Efficient inference of bacterial strain trees from genome-scale multilocus data , 2008, ISMB.

[19]  D. Pearl,et al.  High-resolution species trees without concatenation , 2007, Proceedings of the National Academy of Sciences.