Statistical binning enables an accurate coalescent-based estimation of the avian tree

Introduction Reconstructing species trees for rapid radiations, as in the early diversification of birds, is complicated by biological processes such as incomplete lineage sorting (ILS) that can cause different parts of the genome to have different evolutionary histories. Statistical methods, based on the multispecies coalescent model and that combine gene trees, can be highly accurate even in the presence of massive ILS; however, these methods can produce species trees that are topologically far from the species tree when estimated gene trees have error. We have developed a statistical binning technique to address gene tree estimation error and have explored its use in genome-scale species tree estimation with MP-EST, a popular coalescent-based species tree estimation method. The statistical binning pipeline for estimating species trees from gene trees. Loci are grouped into bins based on a statistical test for combinabilty, before estimating gene trees. Rationale In statistical binning, phylogenetic trees on different genes are estimated and then placed into bins, so that the differences between trees in the same bin can be explained by estimation error (see the figure). A new tree is then estimated for each bin by applying maximum likelihood to a concatenated alignment of the multiple sequence alignments of its genes, and a species tree is estimated using a coalescent-based species tree method from these supergene trees. Results Under realistic conditions in our simulation study, statistical binning reduced the topological error of species trees estimated using MP-EST and enabled a coalescent-based analysis that was more accurate than concatenation even when gene tree estimation error was relatively high. Statistical binning also reduced the error in gene tree topology and species tree branch length estimation, especially when the phylogenetic signal in gene sequence alignments was low. Species trees estimated using MP-EST with statistical binning on four biological data sets showed increased concordance with the biological literature. When MP-EST was used to analyze 14,446 gene trees in the avian phylogenomics project, it produced a species tree that was discordant with the concatenation analysis and conflicted with prior literature. However, the statistical binning analysis produced a tree that was highly congruent with the concatenation analysis and was consistent with the prior scientific literature. Conclusions Statistical binning reduces the error in species tree topology and branch length estimation because it reduces gene tree estimation error. These improvements are greatest when gene trees have reduced bootstrap support, which was the case for the avian phylogenomics project. Because using unbinned gene trees can result in overestimation of ILS, statistical binning may be helpful in providing more accurate estimations of ILS levels in biological data sets. Thus, statistical binning enables highly accurate species tree estimations, even on genome-scale data sets. Gene tree incongruence arising from incomplete lineage sorting (ILS) can reduce the accuracy of concatenation-based estimations of species trees. Although coalescent-based species tree estimation methods can have good accuracy in the presence of ILS, they are sensitive to gene tree estimation error. We propose a pipeline that uses bootstrapping to evaluate whether two genes are likely to have the same tree, then it groups genes into sets using a graph-theoretic optimization and estimates a tree on each subset using concatenation, and finally produces an estimated species tree from these trees using the preferred coalescent-based method. Statistical binning improves the accuracy of MP-EST, a popular coalescent-based method, and we use it to produce the first genome-scale coalescent-based avian tree of life.

[1]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[2]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[3]  Daniel Brélaz,et al.  New methods to color the vertices of a graph , 1979, CACM.

[4]  C. Woese,et al.  The Deinococcus-Thermus phylum and the effect of rRNA composition on phylogenetic tree construction. , 1989, Systematic and applied microbiology.

[5]  M. Ragan Phylogenetic inference based on matrix representation of trees. , 1992, Molecular phylogenetics and evolution.

[6]  Tandy J. Warnow,et al.  Tree compatibility and inferring evolutionary history , 1994, SODA '93.

[7]  C. Nielsen Animal Evolution: Interrelationships of the Living Phyla , 1995 .

[8]  W. Maddison Gene Trees in Species Trees , 1997 .

[9]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[10]  Tandy J. Warnow,et al.  Designing fast converging phylogenetic methods , 2001, ISMB.

[11]  Derrick J. Zwickl,et al.  Phylogenetic relationships of the dwarf boas and a comparison of Bayesian and bootstrap measures of phylogenetic support. , 2002, Molecular phylogenetics and evolution.

[12]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[13]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[14]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[15]  F. Delsuc,et al.  Tunicates and not cephalochordates are the closest living relatives of vertebrates , 2006, Nature.

[16]  Sarah J. Bourlat,et al.  Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida , 2006, Nature.

[17]  D. Pearl,et al.  Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. , 2007, Systematic biology.

[18]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[19]  A. Minelli Animal Evolution: Interrelationships of the Living Phyla , 2007 .

[20]  Tom H. Pringle,et al.  Molecular and Genomic Data Identify the Closest Living Relative of Primates , 2007, Science.

[21]  D. Pearl,et al.  High-resolution species trees without concatenation , 2007, Proceedings of the National Academy of Sciences.

[22]  W. A. Cox,et al.  A Phylogenomic Study of Birds Reveals Their Evolutionary History , 2008, Science.

[23]  Mia Hubert,et al.  An adjusted boxplot for skewed distributions , 2008, Comput. Stat. Data Anal..

[24]  Edward Susko,et al.  Testing congruence in phylogenomic analysis. , 2008, Systematic biology.

[25]  J. Dutheil,et al.  Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs , 2008, BMC Evolutionary Biology.

[26]  Tae-Kun Seo Calculating bootstrap probabilities of phylogeny using multilocus sequence data. , 2008, Molecular biology and evolution.

[27]  Liang Liu,et al.  BEST: Bayesian estimation of species trees under the coalescent model , 2008, Bioinform..

[28]  S. Edwards IS A NEW AND GENERAL THEORY OF MOLECULAR SYSTEMATICS EMERGING? , 2009, Evolution; international journal of organic evolution.

[29]  Corinne Da Silva,et al.  Phylogenomics Revives Traditional Views on Deep Animal Relationships , 2009, Current Biology.

[30]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[31]  L Lacey Knowles,et al.  Estimating species trees: methods of phylogenetic analysis when there is incongruence across genes. , 2009, Systematic biology.

[32]  Frédéric Delsuc,et al.  Tunicate mitogenomics and phylogenetics: peculiarities of the Herdmania momus mitochondrial genome and support for the new chordate phylogeny , 2009, BMC Genomics.

[33]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[34]  D. Pearl,et al.  Estimating species phylogenies using coalescence times among sequences. , 2009, Systematic biology.

[35]  Noah A Rosenberg,et al.  Gene tree discordance, phylogenetic inference and the multispecies coalescent. , 2009, Trends in ecology & evolution.

[36]  B. Schierwater,et al.  Concatenated Analysis Sheds Light on Early Metazoan Evolution and Fuels a Modern “Urmetazoon” Hypothesis , 2009, PLoS biology.

[37]  David Bryant,et al.  Properties of consensus methods for inferring species trees from gene trees. , 2008, Systematic biology.

[38]  B. Dujon Yeast evolutionary genomics , 2010, Nature Reviews Genetics.

[39]  J. Degnan,et al.  Fast and consistent estimation of species trees using supermatrix rooted triples. , 2010, Molecular biology and evolution.

[40]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[41]  Elchanan Mossel,et al.  Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[43]  Colin N. Dewey,et al.  BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis , 2010, Bioinform..

[44]  T. J. Robinson,et al.  Impacts of the Cretaceous Terrestrial Revolution and KPg Extinction on Mammal Diversification , 2011, Science.

[45]  T. Miyata,et al.  Phylogenetic relationships among insect orders based on three nuclear protein-coding gene sequences. , 2011, Molecular phylogenetics and evolution.

[46]  Tandy J. Warnow,et al.  Algorithms for MDC-Based Multi-Locus Phylogeny Inference: Beyond Rooted Binary Gene Trees on Single Alleles , 2011, J. Comput. Biol..

[47]  Gonzalo Giribet,et al.  Higher-level metazoan relationships: recent progress and remaining questions , 2011, Organisms Diversity & Evolution.

[48]  M. Kiefmann,et al.  Mesozoic retroposons reveal parrots as the closest living relatives of passerine birds , 2011, Nature communications.

[49]  Hayley C. Lanier,et al.  Is recombination a problem for species-tree analyses? , 2012, Systematic biology.

[50]  Ya-ping Zhang,et al.  Summary of Laurasiatheria (mammalia) phylogeny. , 2013, Dong wu xue yan jiu = Zoological research.

[51]  Sen Song,et al.  Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model , 2012, Proceedings of the National Academy of Sciences.

[52]  David Bryant,et al.  Next-generation sequencing reveals phylogeographic structure and a species tree for recent bird divergences. , 2009, Molecular phylogenetics and evolution.

[53]  E. Braun,et al.  Testing hypotheses about the sister group of the passeriformes using an independent 30-locus data set. , 2012, Molecular biology and evolution.

[54]  N. Rosenberg Discordance of Species Trees with Their Most Likely Gene Trees: A Unifying Principle , 2013, Molecular biology and evolution.

[55]  Christian Schlötterer,et al.  Linking Great Apes Genome Evolution across Time Scales Using Polymorphism-Aware Phylogenetic Models , 2013, Molecular biology and evolution.

[56]  M. Gouy,et al.  Genome-scale coestimation of species and gene trees , 2013, Genome research.

[57]  Travis C. Glenn,et al.  A Phylogeny of Birds Based on Over 1,500 Loci Collected by Target Enrichment and High-Throughput Sequencing , 2012, PloS one.

[58]  Antonis Rokas,et al.  Inferring ancient divergences requires genes with strong phylogenetic signals , 2013, Nature.

[59]  Tandy J. Warnow,et al.  Naive binning improves phylogenomic analyses , 2013, Bioinform..

[60]  Edward L. Braun,et al.  Error in Phylogenetic Estimation for Bushes in the Tree of Life , 2013 .

[61]  Zhen Yan,et al.  Origin of land plants using the multispecies coalescent model. , 2013, Trends in plant science.

[62]  Nicholas H. Putnam,et al.  The Genome of the Ctenophore Mnemiopsis leidyi and Its Implications for Cell Type Evolution , 2013, Science.

[63]  Lei Zhao,et al.  Phylogenomic Analyses of Nuclear Genes Reveal the Evolutionary Relationships within the BEP Clade and the Evidence of Positive Selection in Poaceae , 2013, PloS one.

[64]  Ning Wang,et al.  Identifying localized biases in large datasets: a case study using the avian tree of life. , 2013, Molecular phylogenetics and evolution.

[65]  Ziheng Yang,et al.  The influence of gene flow on species tree estimation: a simulation study. , 2014, Systematic biology.

[66]  William G. Wadsworth,et al.  This copy is for your personal, non-commercial use only. , 2014 .

[67]  Laura Salter Kubatko,et al.  Quartet Inference from SNP Data Under the Coalescent Model , 2014, Bioinform..

[68]  B. Faircloth,et al.  Target capture and massively parallel sequencing of ultraconserved elements for comparative studies at shallow evolutionary time scales. , 2013, Systematic biology.

[69]  Michael DeGiorgio,et al.  Robustness to divergence time underestimation when inferring species trees from estimated gene trees. , 2014, Systematic biology.

[70]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.