Identifiability of Two-Tree Mixtures for Group-Based Models

Phylogenetic data arising on two possibly different tree topologies might be mixed through several biological mechanisms, including incomplete lineage sorting or horizontal gene transfer in the case of different topologies, or simply different substitution processes on characters in the case of the same topology. Recent work on a 2-state symmetric model of character change showed that for 4 taxa, such a mixture model has nonidentifiable parameters, and thus, it is theoretically impossible to determine the two tree topologies from any amount of data under such circumstances. Here, the question of identifiability is investigated for two-tree mixtures of the 4-state group-based models, which are more relevant to DNA sequence data. Using algebraic techniques, we show that the tree parameters are identifiable for the JC and K2P models. We also prove that generic substitution parameters for the JC mixture models are identifiable, and for the K2P and K3P models obtain generic identifiability results for mixtures on the same tree. This indicates that the full phylogenetic signal remains in such mixtures, and the 2-state symmetric result is thus a misleading guide to the behavior of other models.

[1]  Eric Vigoda,et al.  Phylogeny of Mixture Models: Robustness of Maximum Likelihood and Non-Identifiable Distributions , 2006, J. Comput. Biol..

[2]  Elizabeth S. Allman,et al.  Identifiability of a Markovian model of molecular evolution with gamma-distributed rates , 2007, Advances in Applied Probability.

[3]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[4]  Hans Schönemann,et al.  SINGULAR: a computer algebra system for polynomial computations , 2001, ACCA.

[5]  Seth Sullivant,et al.  Lectures on Algebraic Statistics , 2008 .

[6]  Bernd Sturmfels,et al.  Solving the Likelihood Equations , 2005, Found. Comput. Math..

[7]  W. Fulton Introduction to Toric Varieties. , 1993 .

[8]  D. Penny,et al.  Spectral analysis of phylogenetic data , 1993 .

[9]  E. Allman,et al.  Phylogenetic invariants for the general Markov model of sequence mutation. , 2003, Mathematical biosciences.

[10]  Elizabeth S. Allman,et al.  The Identifiability of Covarion Models in Phylogenetics , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Eric Vigoda,et al.  Phylogenetic MCMC Algorithms Are Misleading on Mixtures of Trees , 2005, Science.

[12]  Mike Steel,et al.  Phylogenetic mixtures on a single tree can mimic a tree of another topology. , 2007, Systematic biology.

[13]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[14]  Juanjuan Chai,et al.  On Rogers' proof of identifiability for the GTR + Γ + I model. , 2011, Systematic biology.

[15]  László A. Székely,et al.  Fourier Calculus on Evolutionary Trees , 1993 .

[16]  Mike A. Steel,et al.  Classifying and Counting Linear Phylogenetic Invariants for the Jukes-Cantor Model , 1995, J. Comput. Biol..

[17]  David E. Speyer,et al.  The tropical Grassmannian , 2003, math/0304218.

[18]  Terence P. Speed,et al.  Invariants of Some Probability Models Used in Phylogenetic Inference , 1993 .

[19]  M. Hendy The Relationship Between Simple Evolutionary Tree Models and Observable Sequence Data , 1989 .

[20]  J. Felsenstein,et al.  Invariants of phylogenies in a simple case with discrete states , 1987 .

[21]  Donal O'Shea,et al.  Ideals, varieties, and algorithms - an introduction to computational algebraic geometry and commutative algebra (2. ed.) , 1997, Undergraduate texts in mathematics.

[22]  A FOURIER INVERSION FORMULA FOR EVOLUTIONARY TREES , 1993 .

[23]  Seth Sullivant,et al.  Toric Ideals of Phylogenetic Invariants , 2004, J. Comput. Biol..

[25]  Steven Rudich Complexity theory: From Gödel to Feynman , 2004, Computational Complexity Theory.

[26]  C. Matias,et al.  Identifiability of parameters in latent structure models with many observed variables , 2008, 0809.5032.

[27]  Joe W. Harris,et al.  Algebraic Geometry: A First Course , 1995 .

[28]  Elizabeth S. Allman,et al.  The Identifiability of Tree Topology for Phylogenetic Models, Including Covarion and Mixture Models , 2005, J. Comput. Biol..

[29]  Michael D. Hendy,et al.  Complete Families of Linear Invariants for Some Stochastic Models of Sequence Evolution, with and without Molecular Clock Assumption , 1996, J. Comput. Biol..

[30]  Jan Draisma A tropical approach to secant dimensions , 2006 .

[31]  J A Lake,et al.  A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. , 1987, Molecular biology and evolution.

[32]  A. L.,et al.  A FOURIER INVERSION FORMULA FOR EVOLUTIONARY TREES , 1992 .

[33]  B. Sturmfels Gröbner bases and convex polytopes , 1995 .

[34]  Elizabeth S. Allman,et al.  Phylogenetic ideals and varieties for the general Markov model , 2004, Adv. Appl. Math..

[35]  Elchanan Mossel,et al.  Mixed-up Trees: the Structure of Phylogenetic Mixtures , 2007, Bulletin of mathematical biology.