Phylogeny of Mixture Models: Robustness of Maximum Likelihood and Non-Identifiable Distributions

We address phylogenetic reconstruction when the data is generated from a mixture distribution. Such topics have gained considerable attention in the biological community with the clear evidence of heterogeneity of mutation rates. In our work we consider data coming from a mixture of trees which share a common topology, but differ in their edge weights (i.e., branch lengths). We first show the pitfalls of popular methods, including maximum likelihood and Markov chain Monte Carlo algorithms. We then determine in which evolutionary models, reconstructing the tree topology, under a mixture distribution, is (im)possible. We prove that every model whose transition matrices can be parameterized by an open set of multilinear polynomials, either has non-identifiable mixture distributions, in which case reconstruction is impossible in general, or there exist linear tests which identify the topology. This duality theorem, relies on our notion of linear tests and uses ideas from convex programming duality. Linear tests are closely related to linear invariants, which were first introduced by Lake, and are natural from an algebraic geometry perspective.

[1]  J. S. Rogers,et al.  Maximum likelihood estimation of phylogenetic trees is consistent when substitution rates vary according to the invariable sites plus gamma distribution. , 2001, Systematic biology.

[2]  Hongkai Ji,et al.  Why do human diversity levels vary at a megabase scale? , 2005, Genome research.

[3]  J A Lake,et al.  A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. , 1987, Molecular biology and evolution.

[4]  J. Kim,et al.  Slicing hyperdimensional oranges: the geometry of phylogenetic estimation. , 2000, Molecular phylogenetics and evolution.

[5]  Eric Vigoda,et al.  Pitfalls of heterogeneous processes for phylogenetic reconstruction. , 2007, Systematic biology.

[6]  Michael R. Green,et al.  Gene Expression , 1993, Progress in Gene Expression.

[7]  D. Kendall,et al.  Mathematics in the Archaeological and Historical Sciences , 1971, The Mathematical Gazette.

[8]  L. Duret,et al.  Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[10]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[11]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[12]  Joseph T. Chang,et al.  Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. , 1996, Mathematical biosciences.

[13]  Martin J Lercher,et al.  Gene expression, synteny, and local similarity in human noncoding mutation rates. , 2004, Molecular biology and evolution.

[14]  A. Dress,et al.  Reconstructing the shape of a tree from observed dissimilarity data , 1986 .

[15]  Elchanan Mossel,et al.  Limitations of Markov chain Monte Carlo algorithms for Bayesian inference of phylogeny , 2005, The Annals of Applied Probability.

[16]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[17]  Svante Janson,et al.  Random graphs , 2000, ZOR Methods Model. Oper. Res..

[18]  Martin J Lercher,et al.  Human SNP variability and mutation rate are higher in regions of high recombination. , 2002, Trends in genetics : TIG.

[19]  T. Warnow,et al.  Unidentifiable divergence times in rates-across-sites models , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  László A. Székely,et al.  Reconstructing Trees When Sequence Sites Evolve at Variable Rates , 1994, J. Comput. Biol..

[21]  Eric Vigoda,et al.  Phylogenetic MCMC Algorithms Are Misleading on Mixtures of Trees , 2005, Science.

[22]  S. Pääbo,et al.  A neutral explanation for the correlation of diversity with recombination rates in humans. , 2003, American journal of human genetics.

[23]  L. Duret,et al.  Recombination drives the evolution of GC-content in the human genome. , 2004, Molecular biology and evolution.

[24]  Bryan Kolaczkowski,et al.  Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous , 2004, Nature.

[25]  Elizabeth S. Allman,et al.  The Identifiability of Tree Topology for Phylogenetic Models, Including Covarion and Mixture Models , 2005, J. Comput. Biol..

[26]  G. Pólya,et al.  Problems and theorems in analysis , 1983 .