A Structural EM Algorithm for Phylogenetic Inference

A central task in the study of molecular evolution is the reconstruction of a phylogenetic tree from sequences of current-day taxa. The most established approach to tree reconstruction is maximum likelihood (ML) analysis. Unfortunately, searching for the maximum likelihood phylogenetic tree is computationally prohibitive for large data sets. In this paper, we describe a new algorithm that uses Structural Expectation Maximization (EM) for learning maximum likelihood phylogenetic trees. This algorithm is similar to the standard EM method for edge-length estimation, except that during iterations of the Structural EM algorithm the topology is improved as well as the edge length. Our algorithm performs iterations of two steps. In the E-step, we use the current tree topology and edge lengths to compute expected sufficient statistics, which summarize the data. In the M-Step, we search for a topology that maximizes the likelihood with respect to these expected sufficient statistics. We show that searching for better topologies inside the M-step can be done efficiently, as opposed to standard methods for topology search. We prove that each iteration of this procedure increases the likelihood of the topology, and thus the procedure must converge. This convergence point, however, can be a suboptimal one. To escape from such "local optima," we further enhance our basic EM procedure by incorporating moves in the flavor of simulated annealing. We evaluate these new algorithms on both synthetic and real sequence data and show that for protein sequences even our basic algorithm finds more plausible trees than existing methods for searching maximum likelihood phylogenies. Furthermore, our algorithms are dramatically faster than such methods, enabling, for the first time, phylogenetic analysis of large protein data sets in the maximum likelihood framework.

[1]  C. Darwin The Descent of Man and Selection in Relation to Sex: INDEX , 1871 .

[2]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[3]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[4]  R. Sokal,et al.  Principles of numerical taxonomy , 1965 .

[5]  R. Sokal,et al.  A METHOD FOR DEDUCING BRANCHING SEQUENCES IN PHYLOGENY , 1965 .

[6]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[7]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[8]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[9]  Richard M. Karp,et al.  The Traveling-Salesman Problem and Minimum Spanning Trees , 1970, Oper. Res..

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[12]  R. Graham,et al.  The steiner problem in phylogeny is NP-complete , 1982 .

[13]  D. Penny,et al.  Branch and bound algorithms to determine minimal evolutionary trees , 1982 .

[14]  R. Graham,et al.  Unlikelihood that minimal phylogenies for a realistic biological study can be constructed in reasonable computational time , 1982 .

[15]  W. H. Day Computationally difficult parsimony problems in phylogenetic systematics , 1983 .

[16]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[17]  David S. Johnson,et al.  The computational complexity of inferring rooted phylogenies by parsimony , 1986 .

[18]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[19]  A. Dress,et al.  Parsimonious phylogenetic trees in metric spaces and simulated annealing , 1987 .

[20]  J. Felsenstein Phylogenies from molecular sequences: inference and reliability. , 1988, Annual review of genetics.

[21]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[22]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[23]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[24]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[25]  Hideo Matsuda,et al.  fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood , 1994, Comput. Appl. Biosci..

[26]  J. Adachi Modeling of molecular evolution and maximumlikelihood inference of molecular phylogeny , 1995 .

[27]  M. Nei,et al.  A new method of inference of ancestral nucleotide and amino acid sequences. , 1995, Genetics.

[28]  J. Adachi,et al.  MOLPHY version 2.3 : programs for molecular phylogenetics based on maximum likelihood , 1996 .

[29]  D. Barker LVB 1.0: Reconstructing Evolution with Parsimony and Simulated Annealing , 1997 .

[30]  Tandy J. Warnow,et al.  Parsimony is Hard to Beat , 1997, COCOON.

[31]  Nir Friedman,et al.  Learning Belief Networks in the Presence of Missing Values and Hidden Variables , 1997, ICML.

[32]  P. Lewis,et al.  A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. , 1998, Molecular biology and evolution.

[33]  David Bruce Wilson,et al.  How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph , 1998, J. Algorithms.

[34]  J. S. Rogers,et al.  Multiple local maxima for likelihoods of phylogenetic trees: a simulation study. , 1999, Molecular biology and evolution.

[35]  Daniel H. Huson,et al.  Solving Large Scale Phylogenetic Problems using DCM2 , 1999, ISMB.

[36]  K. Nixon The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis , 1999 .

[37]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[38]  Barbara R. Holland,et al.  Multiple maxima of likelihood in phylogenetic trees: an analytic approach , 2000, RECOMB '00.

[39]  C. Gissi,et al.  Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. , 2000, Molecular biology and evolution.

[40]  O. Gascuel,et al.  Quartet-based phylogenetic inference: improvements and limits. , 2001, Molecular biology and evolution.

[41]  Dale Schuurmans,et al.  Data perturbation for escaping local maxima in learning , 2002, AAAI/IAAI.

[42]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.