Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multi-species coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGorgio and Degnan (2010) and Chifman and Kubtako (2014).

[1]  Manuel Dehnert,et al.  Probability Models for DNA Sequence Evolution (2nd edn.). R. Durrett (2008). New York: Springer. ISBN: 978-0-387-78168-6 , 2009 .

[2]  Noah A Rosenberg,et al.  The probability of topological concordance of gene trees and species trees. , 2002, Theoretical population biology.

[3]  Scott V Edwards,et al.  Coalescent methods for estimating phylogenetic trees. , 2009, Molecular phylogenetics and evolution.

[4]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[5]  Laura Salter Kubatko,et al.  STEM: species tree estimation using maximum likelihood for gene trees under coalescence , 2009, Bioinform..

[6]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[7]  Laura Kubatko,et al.  Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. , 2014, Journal of theoretical biology.

[8]  Colin N. Dewey,et al.  BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis , 2010, Bioinform..

[9]  M Steel,et al.  Links between maximum likelihood and maximum parsimony under a simple model of site substitution. , 1997, Bulletin of mathematical biology.

[10]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[11]  David Bryant,et al.  Properties of consensus methods for inferring species trees from gene trees. , 2008, Systematic biology.

[12]  T. J. Robinson,et al.  Impacts of the Cretaceous Terrestrial Revolution and KPg Extinction on Mammal Diversification , 2011, Science.

[13]  R. Durrett Probability Models for DNA Sequence Evolution , 2002 .

[14]  D. Pearl,et al.  Estimating species phylogenies using coalescence times among sequences. , 2009, Systematic biology.

[15]  F. Tajima Evolutionary relationship of DNA sequences in finite populations. , 1983, Genetics.

[16]  Sudhindra R Gadagkar,et al.  Inferring species phylogenies from multiple genes: concatenated sequence tree versus consensus gene tree. , 2005, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[17]  Mike Steel Consistency of Bayesian inference of resolved phylogenetic trees. , 2013, Journal of theoretical biology.

[18]  Bruce Rannala,et al.  The accuracy of species tree estimation under simulation: a comparison of methods. , 2011, Systematic biology.

[19]  Elchanan Mossel,et al.  Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Tandy Warnow,et al.  Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. , 2016, Systematic biology.

[21]  Tandy J. Warnow,et al.  Naive binning improves phylogenomic analyses , 2013, Bioinform..

[22]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[23]  Qixin He,et al.  Sources of error inherent in species-tree estimation: impact of mutational and coalescent effects on accuracy and implications for choosing among different methods. , 2010, Systematic biology.

[24]  J. Degnan,et al.  Fast and consistent estimation of species trees using supermatrix rooted triples. , 2010, Molecular biology and evolution.

[25]  Sen Song,et al.  Reply to Gatesy and Springer: The multispecies coalescent model can effectively handle recombination and gene tree heterogeneity , 2013, Proceedings of the National Academy of Sciences of the United States of America.

[26]  R. Nichols,et al.  Gene trees and species trees are not the same. , 2001, Trends in ecology & evolution.

[27]  M. Springer,et al.  Concatenation versus coalescence versus “concatalescence” , 2013, Proceedings of the National Academy of Sciences.

[28]  Laura Kubatko,et al.  Estimating species trees : practical and theoretical aspects , 2010 .

[29]  Sébastien Roch,et al.  An Analytical Comparison of Multilocus Methods Under the Multispecies Coalescent: The Three-Taxon Case , 2012, Pacific Symposium on Biocomputing.

[30]  Noah A Rosenberg,et al.  Gene tree discordance, phylogenetic inference and the multispecies coalescent. , 2009, Trends in ecology & evolution.

[31]  Sen Song,et al.  Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model , 2012, Proceedings of the National Academy of Sciences.