Fast and consistent estimation of species trees using supermatrix rooted triples.

Concatenated sequence alignments are often used to infer species-level relationships. Previous studies have shown that analysis of concatenated data using maximum likelihood (ML) can produce misleading results when loci have differing gene tree topologies due to incomplete lineage sorting. Here, we develop a polynomial time method that utilizes the modified mincut supertree algorithm to construct an estimated species tree from inferred rooted triples of concatenated alignments. We term this method SuperMatrix Rooted Triple (SMRT) and use the notation SMRT-ML when rooted triples are inferred by ML. We use simulations to investigate the performance of SMRT-ML under Jukes-Cantor and general time-reversible substitution models for four- and five-taxon species trees and also apply the method to an empirical data set of yeast genes. We find that SMRT-ML converges to the correct species tree in many cases in which ML on the full concatenated data set fails to do so. SMRT-ML can be conservative in that its output tree is often partially unresolved for problematic clades. We show analytically that when the species tree is clocklike and mutations occur under the Cavender-Farris-Neyman substitution model, as the number of genes increases, SMRT-ML is increasingly likely to infer the correct species tree even when the most likely gene tree does not match the species tree. SMRT-ML is therefore a computationally efficient and statistically consistent estimator of the species tree when gene trees are distributed according to the multispecies coalescent model.

[1]  Kai Lai Chung,et al.  A Course in Probability Theory , 1949 .

[2]  J. Neyman MOLECULAR STUDIES OF EVOLUTION: A SOURCE OF NOVEL STATISTICAL PROBLEMS* , 1971 .

[3]  Joseph Felsenstein,et al.  The number of evolutionary trees , 1978 .

[4]  Alfred V. Aho,et al.  Inferring a Tree from Lowest Common Ancestors with an Application to the Optimization of Relational Expressions , 1981, SIAM J. Comput..

[5]  David S. Johnson,et al.  The computational complexity of inferring rooted phylogenies by parsimony , 1986 .

[6]  M. Nei Molecular Evolutionary Genetics , 1987 .

[7]  S. Gupta,et al.  Statistical decision theory and related topics IV , 1988 .

[8]  M. Nei,et al.  Relationships between gene trees and species trees. , 1988, Molecular biology and evolution.

[9]  Michael D. Hendy,et al.  A Framework for the Quantitative Study of Evolutionary Trees , 1989 .

[10]  M. Ragan Phylogenetic inference based on matrix representation of trees. , 1992, Molecular phylogenetics and evolution.

[11]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[12]  B. Baum Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees , 1992 .

[13]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[14]  W. Maddison Gene Trees in Species Trees , 1997 .

[15]  R. Page,et al.  From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. , 1997, Molecular phylogenetics and evolution.

[16]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[17]  M. Nei,et al.  Relationships between Gene Trees and Species Trees1 , 1998 .

[18]  D. Swofford,et al.  Taxon sampling revisited , 1999, Nature.

[19]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[20]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[21]  Z. Yang,et al.  Complexity of the simplest phylogenetic estimation problem , 2000, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[22]  Charles Semple,et al.  A supertree method for rooted trees , 2000, Discret. Appl. Math..

[23]  David Fernández-Baca,et al.  Flipping: A supertree construction method , 2001, Bioconsensus.

[24]  Sudhir Kumar,et al.  Incomplete taxon sampling is not a problem for phylogenetic inference , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[26]  David Bryant,et al.  A classification of consensus methods for phylogenetics , 2001, Bioconsensus.

[27]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[28]  Derrick J. Zwickl,et al.  Increased taxon sampling greatly reduces phylogenetic error. , 2002, Systematic biology.

[29]  Roderic D. M. Page,et al.  Modified Mincut Supertrees , 2002, WABI.

[30]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[31]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[32]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[33]  Bryan Kolaczkowski,et al.  Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous , 2004, Nature.

[34]  O. Bininda-Emonds,et al.  The evolution of supertrees. , 2004, Trends in ecology & evolution.

[35]  Tamir Tuller,et al.  Maximum likelihood of evolutionary trees: hardness and approximation , 2005, ISMB.

[36]  Scott V Edwards,et al.  SPECIATIONAL HISTORY OF AUSTRALIAN GRASS FINCHES (POEPHILA) INFERRED FROM THIRTY GENE TREES* , 2005, Evolution; international journal of organic evolution.

[37]  GENE TREE DISTRIBUTIONS UNDER THE COALESCENT PROCESS , 2005, Evolution; international journal of organic evolution.

[38]  Sudhindra R Gadagkar,et al.  Inferring species phylogenies from multiple genes: concatenated sequence tree versus consensus gene tree. , 2005, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[39]  R. Baker,et al.  Hidden likelihood support in genomic data: can forty-five wrongs make a right? , 2005, Systematic biology.

[40]  S. Carroll,et al.  More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. , 2005, Molecular biology and evolution.

[41]  Eric Vigoda,et al.  Phylogenetic MCMC Algorithms Are Misleading on Mixtures of Trees , 2005, Science.

[42]  Discordance of Species Trees with Their Most Likely Gene Trees , 2006, PLoS genetics.

[43]  D. Hillis,et al.  Resolution of phylogenetic conflict in large data sets by increased taxon sampling. , 2006, Systematic biology.

[44]  Sébastien Roch,et al.  A short proof that phylogenetic tree reconstruction by maximum likelihood is hard , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[45]  C. Simon,et al.  Differentiating between hypotheses of lineage sorting and introgression in New Zealand alpine cicadas (Maoricicada Dugdale). , 2006, Systematic biology.

[46]  Sagi Snir,et al.  Maximum likelihood Jukes-Cantor triplets: analytic solutions. , 2005, Molecular biology and evolution.

[47]  D. Pearl,et al.  Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. , 2007, Systematic biology.

[48]  Noah A. Rosenberg,et al.  Counting Coalescent Histories , 2007, J. Comput. Biol..

[49]  J. Gatesy,et al.  The supermatrix approach to systematics. , 2007, Trends in ecology & evolution.

[50]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[51]  Luay Nakhleh,et al.  Confounding Factors in HGT Detection: Statistical Error, Coalescent Effects, and Multiple Solutions , 2007, J. Comput. Biol..

[52]  Vincent Moulton,et al.  Using supernetworks to distinguish hybridization from lineage-sorting , 2008, BMC Evolutionary Biology.

[53]  D. Pearl,et al.  High-resolution species trees without concatenation , 2007, Proceedings of the National Academy of Sciences.

[54]  Ingo Ebersberger,et al.  Rooted triple consensus and anomalous gene trees , 2008, BMC Evolutionary Biology.

[55]  B. Larget,et al.  Bayesian estimation of concordance among gene trees. , 2006, Molecular biology and evolution.

[56]  Michael D. Hendy,et al.  Analytic solutions for three taxon ML trees with variable rates across sites , 2007, Discret. Appl. Math..

[57]  Montgomery Slatkin,et al.  Subdivision in an ancestral species creates asymmetry in gene trees. , 2008, Molecular biology and evolution.

[58]  Noah A Rosenberg,et al.  Discordance of species trees with their most likely gene trees: the case of five taxa. , 2008, Systematic biology.

[59]  Mike Steel,et al.  Maximum likelihood supertrees. , 2007, Systematic biology.

[60]  Liang Liu,et al.  BEST: Bayesian estimation of species trees under the coalescent model , 2008, Bioinform..

[61]  Laura Salter Kubatko,et al.  Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. , 2009, Theoretical population biology.

[62]  S. Edwards IS A NEW AND GENERAL THEORY OF MOLECULAR SYSTEMATICS EMERGING? , 2009, Evolution; international journal of organic evolution.

[63]  S. Edwards,et al.  Phylogenetic analysis in the anomaly zone. , 2009, Systematic biology.

[64]  S. J. Willson Robustness of Topological Supertree Methods for Reconciling Dense Incompatible Data , 2009, TCBB.

[65]  Noah A Rosenberg,et al.  Gene tree discordance, phylogenetic inference and the multispecies coalescent. , 2009, Trends in ecology & evolution.

[66]  David Bryant,et al.  Properties of consensus methods for inferring species trees from gene trees. , 2008, Systematic biology.