Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites.

The inference of the evolutionary history of a collection of organisms is a problem of fundamental importance in evolutionary biology. The abundance of DNA sequence data arising from genome sequencing projects has led to significant challenges in the inference of these phylogenetic relationships. Among these challenges is the inference of the evolutionary history of a collection of species based on sequence information from several distinct genes sampled throughout the genome. It is widely accepted that each individual gene has its own phylogeny, which may not agree with the species tree. Many possible causes of this gene tree incongruence are known. The best studied is the incomplete lineage sorting, which is commonly modeled by the coalescent process. Numerous methods based on the coalescent process have been proposed for the estimation of the phylogenetic species tree given DNA sequence data. However, use of these methods assumes that the phylogenetic species tree can be identified from DNA sequence data at the leaves of the tree, although this has not been formally established. We prove that the unrooted topology of the n-leaf phylogenetic species tree is generically identifiable given observed data at the leaves of the tree that are assumed to have arisen from the coalescent process under a time-reversible substitution process with the possibility of site-specific rate variation modeled by the discrete gamma distribution and a proportion of invariable sites.

[1]  C. J-F,et al.  THE COALESCENT , 1980 .

[2]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[3]  Elizabeth S. Allman,et al.  The Identifiability of Tree Topology for Phylogenetic Models, Including Covarion and Mixture Models , 2005, J. Comput. Biol..

[4]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[5]  N. Eriksson 19 Tree Construction using Singular Value Decomposition , 2005 .

[6]  Satish Rao,et al.  Quartet MaxCut: a fast algorithm for amalgamating quartet trees. , 2012, Molecular phylogenetics and evolution.

[7]  Eberhard Freitag,et al.  Analytic Functions of Several Complex Variables , 2011 .

[8]  J A Lake,et al.  A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. , 1987, Molecular biology and evolution.

[9]  Seth Sullivant,et al.  Identifiability of Two-Tree Mixtures for Group-Based Models , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[11]  John A Rhodes,et al.  Determining species tree topologies from clade probabilities under the coalescent. , 2011, Journal of theoretical biology.

[12]  Elizabeth S. Allman,et al.  Identifiability of a Markovian model of molecular evolution with gamma-distributed rates , 2007, Advances in Applied Probability.

[13]  John A Rhodes,et al.  Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. , 2007, Mathematical biosciences.

[14]  K. Strimmer,et al.  Bayesian Probabilities and Quartet Puzzling , 1997 .

[15]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[16]  N. Eriksson Algebraic Statistics for Computational Biology: Tree Construction using Singular Value Decomposition , 2005 .

[17]  P. Lewis A likelihood approach to estimating phylogeny from discrete morphological character data. , 2001, Systematic biology.

[18]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[19]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[20]  Laura Salter Kubatko,et al.  STEM: species tree estimation using maximum likelihood for gene trees under coalescence , 2009, Bioinform..

[21]  John A Rhodes,et al.  Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent , 2009, Journal of mathematical biology.

[22]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[23]  Y. Fu,et al.  Linear invariants under Jukes' and Cantor's one-parameter model. , 1995, Journal of theoretical biology.

[24]  David Bryant,et al.  Next-generation sequencing reveals phylogeographic structure and a species tree for recent bird divergences. , 2009, Molecular phylogenetics and evolution.

[25]  D. Pearl,et al.  Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. , 2007, Systematic biology.

[26]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[27]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[28]  M. Nei,et al.  Relationships between gene trees and species trees. , 1988, Molecular biology and evolution.

[29]  W. Li,et al.  Construction of linear invariants in phylogenetic inference. , 1992, Mathematical biosciences.

[30]  Anatolii A. Logunov,et al.  Analytic functions of several complex variables , 1965 .

[31]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[32]  Laura Salter Kubatko,et al.  Quartet Inference from SNP Data Under the Coalescent Model , 2014, Bioinform..

[33]  S. Edwards,et al.  Phylogenetic analysis in the anomaly zone. , 2009, Systematic biology.

[34]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[35]  Elizabeth S. Allman,et al.  The Identifiability of Covarion Models in Phylogenetics , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[37]  Ziheng Yang,et al.  Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci. , 2002, Genetics.

[38]  W. Maddison Gene Trees in Species Trees , 1997 .

[39]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[40]  J. A. Cavender,et al.  Mechanized derivation of linear invariants. , 1989, Molecular biology and evolution.

[41]  S. Tavaré,et al.  Line-of-descent and genealogical processes, and their applications in population genetics models. , 1984, Theoretical population biology.