Molecular clock fork phylogenies: closed form analytic maximum likelihood solutions.

Maximum likelihood (ML) is increasingly used as an optimality criterion for selecting evolutionary trees, but finding the global optimum is a hard computational task. Because no general analytic solution is known, numeric techniques such as hill climbing or expectation maximization (EM) are used in order to find optimal parameters for a given tree. So far, analytic solutions were derived only for the simplest model-three-taxa, two-state characters, under a molecular clock. Quoting Ziheng Yang, who initiated the analytic approach,"this seems to be the simplest case, but has many of the conceptual and statistical complexities involved in phylogenetic estimation."In this work, we give general analytic solutions for a family of trees with four-taxa, two-state characters, under a molecular clock. The change from three to four taxa incurs a major increase in the complexity of the underlying algebraic system, and requires novel techniques and approaches. We start by presenting the general maximum likelihood problem on phylogenetic trees as a constrained optimization problem, and the resulting system of polynomial equations. In full generality, it is infeasible to solve this system, therefore specialized tools for the molecular clock case are developed. Four-taxa rooted trees have two topologies-the fork (two subtrees with two leaves each) and the comb (one subtree with three leaves, the other with a single leaf). We combine the ultrametric properties of molecular clock fork trees with the Hadamard conjugation to derive a number of topology dependent identities. Employing these identities, we substantially simplify the system of polynomial equations for the fork. We finally employ symbolic algebra software to obtain closed formanalytic solutions (expressed parametrically in the input data). In general, four-taxa trees can have multiple ML points. In contrast, we can now prove that each fork topology has a unique(local and global) ML point.

[1]  Audra E. Kosh,et al.  Linear Algebra and its Applications , 1992 .

[2]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[3]  J. Neyman MOLECULAR STUDIES OF EVOLUTION: A SOURCE OF NOVEL STATISTICAL PROBLEMS* , 1971 .

[4]  D Penny,et al.  A discrete Fourier analysis for evolutionary trees. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Barbara R. Holland,et al.  Multiple maxima of likelihood in phylogenetic trees: an analytic approach , 2000, RECOMB '00.

[6]  Michael D. Hendy,et al.  Analytic Solutions for Three-Taxon MLMC Trees with Variable Rates Across Sites , 2001, WABI.

[7]  Z. Yang,et al.  Complexity of the simplest phylogenetic estimation problem , 2000, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[8]  Mike Steel,et al.  The Maximum Likelihood Point for a Phylogenetic Tree is Not Unique , 1994 .

[9]  Sagi Snir,et al.  Maximum likelihood on four taxa phylogenetic trees: analytic solutions , 2003, RECOMB '03.

[10]  D. Penny,et al.  Spectral analysis of phylogenetic data , 1993 .

[11]  B. Chor,et al.  Multiple maxima of likelihood in phylogenetic trees: an analytic approach , 2000, RECOMB '00.

[12]  Alfred V. Aho,et al.  Inferring a Tree from Lowest Common Ancestors with an Application to the Optimization of Relational Expressions , 1981, SIAM J. Comput..

[13]  Elisabeth Renée,et al.  Maximum likelihood with multiparameter models of substitution , 1994, Journal of Molecular Evolution.