Identifiability of a Markovian model of molecular evolution with gamma-distributed rates

Inference of evolutionary trees and rates from biological sequences is commonly performed using continuous-time Markov models of character change. The Markov process evolves along an unknown tree while observations arise only from the tips of the tree. Rate heterogeneity is present in most real data sets and is accounted for by the use of flexible mixture models where each site is allowed its own rate. Very little has been rigorously established concerning the identifiability of the models currently in common use in data analysis, although nonidentifiability was proven for a semiparametric model and an incorrect proof of identifiability was published for a general parametric model (GTR + Γ + I). Here we prove that one of the most widely used models (GTR + Γ) is identifiable for generic parameters, and for all parameter choices in the case of four-state (DNA) models. This is the first proof of identifiability of a phylogenetic model with a continuous distribution of rates.

[1]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[2]  Elizabeth S. Allman,et al.  The Identifiability of Tree Topology for Phylogenetic Models, Including Covarion and Mixture Models , 2005, J. Comput. Biol..

[3]  László A. Székely,et al.  Reconstructing Trees When Sequence Sites Evolve at Variable Rates , 1994, J. Comput. Biol..

[4]  D. Swofford,et al.  The Effect of Taxon Sampling on Estimating Rate Heterogeneity Parameters of Maximum-Likelihood Models , 1999 .

[5]  Eric Vigoda,et al.  Pitfalls of heterogeneous processes for phylogenetic reconstruction. , 2007, Systematic biology.

[6]  Bryan Kolaczkowski,et al.  Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous , 2004, Nature.

[7]  H. D. Miller,et al.  The Theory Of Stochastic Processes , 1977, The Mathematical Gazette.

[8]  M. Pagel,et al.  A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. , 2004, Systematic biology.

[9]  Eric Vigoda,et al.  Phylogeny of Mixture Models: Robustness of Maximum Likelihood and Non-Identifiable Distributions , 2006, J. Comput. Biol..

[10]  Olivier Gascuel,et al.  Modelling the Variability of Evolutionary Processes , 2007 .

[11]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[12]  J. S. Rogers,et al.  Maximum likelihood estimation of phylogenetic trees is consistent when substitution rates vary according to the invariable sites plus gamma distribution. , 2001, Systematic biology.

[13]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[14]  Mike Steel,et al.  Phylogenetic mixtures on a single tree can mimic a tree of another topology. , 2007, Systematic biology.

[15]  Elchanan Mossel,et al.  Mixed-up Trees: the Structure of Phylogenetic Mixtures , 2007, Bulletin of mathematical biology.

[16]  John A Rhodes,et al.  Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. , 2008, Mathematical biosciences.