Tutorial on Phylogenetic Tree Estimation

1 Tutorial Summary All biological disciplines are united by the idea that species share a common history. The genealogical history of life-also called an \evolutionary tree"-is usually represented by a bifurcating, leaf-labeled tree. The use of evolutionary trees is a fundamental step in many biological problems, such as multiple sequence alignments, protein structure and function prediction, and drug design. The primary scientiic objective of phylogenetic studies is not to solve a given optimization problem, but rather to recover the order of speciation or gene duplication events represented by the topology of the true evolutionary tree. (Locating the root of the evolutionary tree is a scientiically diicult task, so that a method is considered to have been successful if it recovers the topology of the unrooted tree.) This means that good or poor performance with respect to optimization problems is only important to the degree that it guarantees good or poor performance with respect to topology estimation. Unfortunately, inferring evolutionary trees is an enormously diicult problem for several reasons. For one, the phylogeny problem is a diicult statistical problem because its parameter space has a complicated structure, and there is nòoo the shelf' solution to the phylogeny problem that can be applied. The phylogeny problem also presents a considerable computational challenge. Typical data sets now consist of several hundred species, and presently available tree reconstruction methods are inadequate to the task of analyzing such datasets. For example, an rbcL DNA sequence data set of 500 plants has been analyzed for several years now, without solution. The explanation for why these analyses are so diicult is simple: the optimization problems are NP-hard, and the heuristics used in an attempt to solve these optimization problems use hill-climbing techniques to search through an exponentially large space of phylogenetic trees. Statistical approaches towards phylogeny reconstruction have modeled the evolutionary process stochasti-cally, and have studied the performance of methods for recovering phylogenetic trees in terms of the accuracy of these methods on datasets of nite length sequences generated under diierent model trees. These studies have shown that some methods recover the true tree topology with high probability, once the sequences are long enough, while other methods have no such guarantees. Over the last decade or so, computer scientists have also begun to design and analyze the performance of phylogenetic methods under these statistical models. One of the results of this interest in using statistical models of …

[1]  C. J. Jardine,et al.  The structure and construction of taxonomic hierarchies , 1967 .

[2]  Sampath Kannan,et al.  A fast algorithm for the computation and enumeration of perfect phylogenies when the number of character states is fixed , 1995, SODA '95.

[3]  J. Felsenstein,et al.  Invariants of phylogenies in a simple case with discrete states , 1987 .

[4]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[5]  Gildas Brossier,et al.  Approximation des dissimilarités par des arbres additifs , 1985 .

[6]  G. F. Estabrook,et al.  An algebraic analysis of cladistic characters , 1976, Discret. Math..

[7]  Jaime Cohen,et al.  Numerical taxonomy on data: experimental results , 1997, SODA '97.

[8]  T. Warnow Mathematical approaches to comparative linguistics. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Z. Yang,et al.  How often do wrong models produce better phylogenies? , 1997, Molecular biology and evolution.

[10]  Tandy J. Warnow,et al.  Kaikoura Tree Theorems: Computing the Maximum Agreement Subtree , 1993, Inf. Process. Lett..

[11]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[12]  W. H. Day Computational complexity of inferring phylogenies from dissimilarity matrices. , 1987, Bulletin of mathematical biology.

[13]  L. Jin,et al.  Limitations of the evolutionary parsimony method of phylogenetic analysis. , 1990, Molecular biology and evolution.

[14]  Book Reviews,et al.  The Bronze Age and Early Iron Age Peoples of Eastern Central Asia , 1998 .

[15]  Le Quesne,et al.  The Uniquely Evolved Character Concept , 1977 .

[16]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[17]  Henk Meijer,et al.  Inferring evolutionary trees from ordinal data , 1997, SODA '97.

[18]  Sampath KannanyNovember Eecient Algorithms for Inverting Evolution , 1995 .

[19]  Eugene L. Lawler,et al.  Determining the evolutionary tree , 1990, SODA '90.

[20]  F. McMorris,et al.  When are two qualitative taxonomic characters compatible? , 1977, Journal of mathematical biology.

[21]  Andris Ambainis,et al.  Nearly tight bounds on the learnability of evolution , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[22]  Gareth Nelson,et al.  Systematics and Biogeography: Cladistics and Vicariance , 1981 .

[23]  M. Steel,et al.  Modeling the covarion hypothesis of nucleotide substitution. , 1998, Mathematical biosciences.

[24]  Sampath Kannan,et al.  Tree Reconstruction from Partial Orders , 1993, WADS.

[25]  W. Fitch,et al.  Evidence from nuclear sequences that invariable sites should be considered when sequence divergence is calculated. , 1989, Molecular biology and evolution.

[26]  Ton Kloks,et al.  A Simple Linear Time Algorithm for Triangulating Three-Colored Graphs , 1992, J. Algorithms.

[27]  Edward N. AdamsIII N-trees as nestings: Complexity, similarity, and consensus , 1986 .

[28]  K. Strimmer,et al.  Accuracy of neighbor joining for n-taxon trees , 1996 .

[29]  David Fernández-Baca,et al.  Simple Algorithms for Perfect Phylogeny and Triangulating Colored Graphs , 1996, Int. J. Found. Comput. Sci..

[30]  H. Ross Principles of Numerical Taxonomy , 1964 .

[31]  Joseph T. Chang,et al.  Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. , 1996, Mathematical biosciences.

[32]  Jean-Pierre Barthélemy,et al.  A Formal Theory of Consensus , 1991, SIAM J. Discret. Math..

[33]  A. Wilson,et al.  The recent African genesis of humans. , 1992, Scientific American.

[34]  J. Farris A Probability Model for Inferring Evolutionary Trees , 1973 .

[35]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[36]  W. H. Day Computationally difficult parsimony problems in phylogenetic systematics , 1983 .

[37]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[38]  László A. Székely,et al.  Reconstructing Trees When Sequence Sites Evolve at Variable Rates , 1994, J. Comput. Biol..

[39]  Sampath Kannan,et al.  Triangulating three-colored graphs , 1991, SODA '91.

[40]  Paul E. Kearney,et al.  A Six-Point Condition for Ordinal Matrices , 1997, J. Comput. Biol..

[41]  J. Hein Unified approach to alignment and phylogenies. , 1990, Methods in enzymology.

[42]  N. Saitou,et al.  Relative Efficiencies of the Fitch-Margoliash, Maximum-Parsimony, Maximum-Likelihood, Minimum-Evolution, and Neighbor-joining Methods of Phylogenetic Tree Construction in Obtaining the Correct Tree , 1989 .

[43]  Olivier Gascuel,et al.  On the Interpretation of Bootstrap Trees: Appropriate Threshold of Clade Selection and Induced Gain , 1996 .

[44]  Mikkel Thorup,et al.  Optimal evolutionary tree comparison by sparse dynamic programming , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[45]  Alain Guénoche,et al.  Trees and proximity representations , 1991, Wiley-Interscience series in discrete mathematics and optimization.

[46]  Sampath Kannan,et al.  Computing the local consensus of trees , 1995, SODA '95.

[47]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[48]  Joseph T. Chang,et al.  Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. , 1996, Mathematical biosciences.

[49]  David Fernández-Baca,et al.  On the Approximability of the Steiner Tree Problem in Phylogeny , 1996, Discret. Appl. Math..

[50]  Tandy J. Warnow,et al.  Reconstructing the evolutionary history of natural languages , 1996, SODA '96.

[51]  Junhyong Kim,et al.  GENERAL INCONSISTENCY CONDITIONS FOR MAXIMUM PARSIMONY: EFFECTS OF BRANCH LENGTHS AND INCREASING NUMBERS OF TAXA , 1996 .

[52]  Lusheng Wang,et al.  Improved Approximation Algorithms for Tree Alignment , 1996, J. Algorithms.

[53]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[54]  Mikkel Thorup,et al.  On the approximability of numerical taxonomy (fitting distances by tree metrics) , 1996, SODA '96.

[55]  W. H. Day Optimal algorithms for comparing trees with labeled leaves , 1985 .

[56]  Arndt von Haeseler,et al.  PERFORMANCE OF THE MAXIMUM LIKELIHOOD, NEIGHBOR JOINING, AND MAXIMUM PARSIMONY METHODS WHEN SEQUENCE SITES ARE NOT INDEPENDENT , 1995 .

[57]  J H Gillespie,et al.  The molecular clock may be an episodic clock. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[58]  N. Goldman,et al.  Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. , 1994, Molecular biology and evolution.

[59]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[60]  A. Dress,et al.  Reconstructing the shape of a tree from observed dissimilarity data , 1986 .

[61]  David Fernández-Baca,et al.  Fast Algorithms for Inferring Evolutionary Trees , 1995, J. Comput. Biol..

[62]  David Sankoff,et al.  COMPUTATIONAL COMPLEXITY OF INFERRING PHYLOGENIES BY COMPATIBILITY , 1986 .

[63]  Mike Steel,et al.  Convex tree realizations of partitions , 1992 .

[64]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[65]  Roderic D. M. Page,et al.  Tracks and Trees in the Antipodes: A Reply to Humphries and Seberg , 1990 .

[66]  Derek G. Corneil,et al.  Complexity of finding embeddings in a k -tree , 1987 .

[67]  Cynthia A. Phillips,et al.  The Asymmetric Median Tree - A New Model for Building Consensus Trees , 1996, Discret. Appl. Math..

[68]  J. Felsenstein Phylogenies from molecular sequences: inference and reliability. , 1988, Annual review of genetics.

[69]  F. McMorris On the compatibility of binary qualitative taxonomic characters. , 1977, Bulletin of mathematical biology.

[70]  F. McMorris,et al.  The median procedure for n-trees , 1986 .

[71]  Olivier Gascuel,et al.  Inferring evolutionary trees with strong combinatorial evidence , 1997, Theor. Comput. Sci..

[72]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[73]  Gareth Nelson,et al.  Cladistic Analysis and Synthesis: Principles and Definitions, with a Historical Note on Adanson's Familles Des Plantes (1763–1764) , 1979 .

[74]  David Fernández-Baca,et al.  A Polynomial-Time Algorithm for the Perfect Phylogeny Problem when the Number of Character States is Fixed , 1993, FOCS.

[75]  W. J. Quesne The Uniquely Evolved Character Concept and its Cladistic Application , 1974 .

[76]  W. J. Quesne,et al.  A Method of Selection of Characters in Numerical Taxonomy , 1969 .

[77]  M. Kimura Estimation of evolutionary distances between homologous nucleotide sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[78]  Roderic D. M. Page,et al.  Genes, organisms, and areas: the problem of multiple lineages , 1993 .

[79]  W. Li,et al.  Estimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[80]  Cynthia A. Phillips,et al.  Constructing evolutionary trees in the presence of polymorphic characters , 1996, STOC '96.

[81]  J. Felsenstein Numerical Methods for Inferring Evolutionary Trees , 1982, The Quarterly Review of Biology.

[82]  Ming-Yang Kao,et al.  Recovering evolutionary trees through harmonic greedy triplets , 1999, SODA '99.

[83]  Mikkel Thorup,et al.  On the Agreement of Many Trees , 1995, Inf. Process. Lett..

[84]  Alejandro A. Schäffer,et al.  Triangulating Three-Colored Graphs in Linear Time and Linear Space , 1993, SIAM J. Discret. Math..

[85]  Mike Steel,et al.  The complexity of the median procedure for binary trees , 1994 .

[86]  W. A. Beyer,et al.  Additive evolutionary trees. , 1977, Journal of theoretical biology.

[87]  J. Carroll,et al.  Spatial, non-spatial and hybrid models for scaling , 1976 .

[88]  A. Dress,et al.  A canonical decomposition theory for metrics on a finite set , 1992 .

[89]  T. Szaro,et al.  Rate and mode differences between nuclear and mitochondrial small-subunit rRNA genes in mushrooms. , 1992, Molecular biology and evolution.

[90]  P. H. A. Sneath Mathematics in the Archaeological and Historical Sciences , 1972 .

[91]  G. Estabrook,et al.  An idealized concept of the true cladistic character , 1975 .

[92]  Daniel H. Huson,et al.  Obtaining highly accurate topology estimates of evolutionary trees from very short sequences , 1999, RECOMB.

[93]  A. Templeton Human origins and analysis of mitochondrial DNA sequences. , 1992, Science.

[94]  R. Graham,et al.  The steiner problem in phylogeny is NP-complete , 1982 .

[95]  Sampath Kannan,et al.  Inferring evolutionary history from DNA sequences , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[96]  G. B. Golding,et al.  Estimates of DNA and protein sequence divergence: an examination of some assumptions. , 1983, Molecular biology and evolution.

[97]  F. McMorris,et al.  When is one estimate of evolutionary relationships a refinement of another? , 1980 .

[98]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[99]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[100]  M. Nei,et al.  Relative efficiencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. , 1988, Molecular biology and evolution.

[101]  Wen-Hsiung Li,et al.  The molecular clock runs more slowly in man than in apes and monkeys , 1987, Nature.

[102]  Mihir Bellare,et al.  Improved non-approximability results , 1994, STOC '94.

[103]  G. Estabrook,et al.  A SIMPLE TEST FOR THE POSSIBLE SIMULTANEOUS EVOLUTIONARY DIVERGENCE OF TWO AMINO ACID POSITIONS , 1975 .

[104]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[105]  D. Aldous PROBABILITY DISTRIBUTIONS ON CLADOGRAMS , 1996 .

[106]  S. Jeffery Evolution of Protein Molecules , 1979 .

[107]  M Steel,et al.  Links between maximum likelihood and maximum parsimony under a simple model of site substitution. , 1997, Bulletin of mathematical biology.

[108]  Tandy J. Warnow,et al.  Tree compatibility and inferring evolutionary history , 1994, SODA '93.

[109]  Fred R. McMorris,et al.  Triangulating vertex colored graphs , 1994, SODA '93.

[110]  W. Brown,et al.  Nuclear and mitochondrial DNA comparisons reveal extreme rate variation in the molecular clock. , 1986, Science.

[111]  E. N. Adams Consensus Techniques and the Comparison of Taxonomic Trees , 1972 .

[112]  D. Hillis Inferring complex phylogenies. , 1996, Nature.

[113]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[114]  F. McMorris,et al.  A Mathematical Foundation for the Analysis of Cladistic Character Compatibility , 1976 .

[115]  Amihood Amir,et al.  Maximum Agreement Subtree in a Set of Evolutionary Trees: Metrics and Efficient Algorithms , 1997, SIAM J. Comput..

[116]  Ming-Yang Kao Tree Contractions and Evolutionary Trees , 1998, SIAM J. Comput..

[117]  Dan Gusfield,et al.  Efficient algorithms for inferring evolutionary trees , 1991, Networks.