Maximum likelihood estimation of phylogenetic tree and substitution rates via generalized neighbor-joining and the EM algorithm

A central task in the study of molecular sequence data from present-day species is the reconstruction of the ancestral relationships. The most established approach to tree reconstruction is the maximum likelihood (ML) method. In this method, evolution is described in terms of a discrete-state continuous-time Markov process on a phylogenetic tree. The substitution rate matrix, that determines the Markov process, can be estimated using the expectation maximization (EM) algorithm. Unfortunately, an exhaustive search for the ML phylogenetic tree is computationally prohibitive for large data sets. In such situations, the neighbor-joining (NJ) method is frequently used because of its computational speed. The NJ method reconstructs trees by clustering neighboring sequences recursively, based on pairwise comparisons between the sequences. The NJ method can be generalized such that reconstruction is based on comparisons of subtrees rather than pairwise distances. In this paper, we present an algorithm for simultaneous substitution rate estimation and phylogenetic tree reconstruction. The algorithm iterates between the EM algorithm for estimating substitution rates and the generalized NJ method for tree reconstruction. Preliminary results of the approach are encouraging.

[1]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[2]  J. Nasrallah,et al.  Recognition and rejection of self in plant reproduction. , 2002, Science.

[3]  D. Faith Conservation evaluation and phylogenetic diversity , 1992 .

[4]  Olivier Gascuel,et al.  Markov Models in Molecular Evolution , 2005 .

[5]  Ruriko Yoshida,et al.  Algebraic Statistics for Computational Biology: Applications of Interval Methods to Phylogenetics , 2005 .

[6]  Alan Aderem,et al.  Recognition and Rejection of Self in Plant Reproduction , 2002 .

[7]  Lior Pachter,et al.  Neighbor joining with phylogenetic diversity estimates , 2005, q-bio/0508001.

[8]  J. Felsenstein Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. , 1996, Methods in enzymology.

[9]  Stéphane Aris-Brosou,et al.  How Bayes tests of molecular phylogenies compare with frequentist approaches , 2003, Bioinform..

[10]  P. Guttorp Stochastic modeling of scientific data , 1995 .

[11]  V. B. Yap,et al.  Modeling DNA Base Substitution in Large Genomic Regions from Two Organisms , 2003, Journal of Molecular Evolution.

[12]  Brendan D. McKay,et al.  TrExML: a maximum-likelihood approach for extensive tree-space exploration , 2000, Bioinform..

[13]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[14]  A. Hobolth,et al.  Statistical Applications in Genetics and Molecular Biology Statistical Inference in Evolutionary Models of DNA Sequences via the EM Algorithm , 2011 .

[15]  J. A. Studier,et al.  A note on the neighbor-joining algorithm of Saitou and Nei. , 1988, Molecular biology and evolution.

[16]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[17]  Lior Pachter,et al.  Reconstructing trees from subtree weights , 2003, Appl. Math. Lett..

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Lior Pachter,et al.  Beyond pairwise distances: neighbor-joining with phylogenetic diversity estimates. , 2006, Molecular biology and evolution.

[20]  R. Nielsen,et al.  Detecting Site-Specific Physicochemical Selective Pressures: Applications to the Class I HLA of the Human Major Histocompatibility Complex and the SRK of the Plant Sporophytic Self-Incompatibility System , 2005, Journal of Molecular Evolution.