Evolution can be mathematically modelled by a stochastic process that operates on the DNA of species. Such models are based on the established theory that the DNA sequences, or genomes, of all extant species have been derived from the genome of the common ancestor of all species by a process of random mutation and natural selection. A stochastic model of evolution can be used to construct phylogenies, or evolutionary trees, for a set of species. Maximum Likelihood Estimations (MLE) methods seek the evolutionary tree which is most likely to have produced the DNA under consideration. While these methods are intellectually satisfying, they have not been widely accepted because of their computational intractability. In this paper, we address the intractability of MLE methods as follows. We introduce a metric on stochastic process models of evolution. We show that this metric is meaningful by proving that in order for any algorithm to distinguish between two stochatic models that are close according to this metric, it needs to be given many observations. We complement this result with a simple and efficient algorithm for inverting the stochastic process of evolution, that is, for building a tree from observations on two-state characters. (We have used the same techniques in a subsequent paper to solve the problem for multistate characters, and hence for building a tree from DNA sequence data.) The tree we build is provably close, in our metric, to the tree generating the data and gets closer as more observations become available. Though there have been many heuristics suggested for the problem of finding good approximations to the most likely tree, our algorithm is the first one with a guaranteed convergence rate, and further, this rate is within a polynomial of the lower-bound rate we establish. Ours is also the the first polynomial-time algorithm which is proven to converge at all to the correct tree. Rutgers University; farach@cs.rutgers.edu; http://www.cs.rutgers.edu/∼farach; Supported by an NSF Career Advancement Award and an Alfred P. Sloan Research Fellowship. University of Pennsylvania; kannan@central.cis.upenn.edu; http://www.cis.upenn.edu/∼kannan/home.html; Supported by NSF CCR 96-19910 and NSF SGER 9612829
[1]
Joseph Felsenstein,et al.
Statistical inference of phylogenies
,
1983
.
[2]
J. Felsenstein.
Cases in which Parsimony or Compatibility Methods will be Positively Misleading
,
1978
.
[3]
J. A. Cavender.
Taxonomy with confidence
,
1978
.
[4]
Tandy J. Warnow,et al.
Constructing Big Trees from Short Sequences
,
1997,
ICALP.
[5]
D Penny,et al.
A discrete Fourier analysis for evolutionary trees.
,
1994,
Proceedings of the National Academy of Sciences of the United States of America.
[6]
J. Farris.
Estimating Phylogenetic Trees from Distance Matrices
,
1972,
The American Naturalist.
[7]
Ronitt Rubinfeld,et al.
On the learnability of discrete distributions
,
1994,
STOC '94.
[8]
Mikkel Thorup,et al.
On the approximability of numerical taxonomy (fitting distances by tree metrics)
,
1996,
SODA '96.
[9]
Andris Ambainis,et al.
Nearly tight bounds on the learnability of evolution
,
1997,
Proceedings 38th Annual Symposium on Foundations of Computer Science.
[10]
M. Nei.
Molecular Evolutionary Genetics
,
1987
.
[11]
N. Saitou,et al.
The neighbor-joining method: a new method for reconstructing phylogenetic trees.
,
1987,
Molecular biology and evolution.
[12]
David S. Johnson,et al.
The computational complexity of inferring rooted phylogenies by parsimony
,
1986
.
[13]
J. Felsenstein.
Phylogenies from molecular sequences: inference and reliability.
,
1988,
Annual review of genetics.
[14]
László A. Székely,et al.
The number of nucleotide sites needed to accurately reconstructlarge evolutionary trees
,
1996
.
[15]
R L Kashyap,et al.
Statistical estimation of parameters in a phylogenetic tree using a dynamic model of the substitutional process.
,
1974,
Journal of theoretical biology.
[16]
J. Felsenstein.
Numerical Methods for Inferring Evolutionary Trees
,
1982,
The Quarterly Review of Biology.
[17]
M. Kearns.
On the Learnability of Discrete Distributions Extended Abstract
,
1994
.
[18]
Rajeev Motwani,et al.
Randomized Algorithms
,
1995,
SIGA.