On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled.

Using analytical methods, we show that under a variety of model misspecifications, Neighbor-Joining, minimum evolution, and least squares estimation procedures are statistically inconsistent. Failure to correctly account for differing rates-across-sites processes, failure to correctly model rate matrix parameters, and failure to adjust for parallel rates-across-sites changes (a rates-across-subtrees process) are all shown to lead to a "long branch attraction" form of inconsistency. In addition, failure to account for rates-across-sites processes is also shown to result in underestimation of evolutionary distances for a wide variety of substitution models, generalizing an earlier analytical result for the Jukes-Cantor model reported in Golding and a similar bias result for the GTR or REV model in Kelly and Rice (1996). Although standard rates-across-sites models can be employed in many of these cases to restore consistency, current models cannot account for other kinds of misspecification. We examine an idealized but biologically relevant case, where parallel changes in rates at sites across subtrees is shown to give rise to inconsistency. This changing rates-across-subtrees type model misspecification cannot be adjusted for with conventional methods or without carefully considering the rate variation in the larger tree. The results are presented for four-taxon trees, but the expectation is that they have implications for larger trees as well. To illustrate this, a simulated 42-taxon example is given in which the microsporidia, an enigmatic group of eukaryotes, are incorrectly placed at the archaebacteria-eukaryotes split because of incorrectly specified pairwise distances. The analytical nature of the results lend insight into the reasons that long branch attraction tends to be a common form of inconsistency and reasons that other forms of inconsistency like "long branches repel" can arise in some settings. In many of the cases of inconsistency presented, a particular incorrect topology is estimated with probability converging to one, the implication being that measures of uncertainty like bootstrap support will be unable to detect that there is a problem with the estimation. The focus is on distance methods, but previous simulation results suggest that the zones of inconsistency for distance methods contain the zones of inconsistency for maximum likelihood methods as well.

[1]  J. Rice,et al.  Modeling nucleotide evolution: a heterogeneous rate analysis. , 1996, Mathematical biosciences.

[2]  Joseph T. Chang,et al.  Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. , 1996, Mathematical biosciences.

[3]  Y. Inagaki,et al.  Evolution of the eukaryotic translation termination system: origins of release factors. , 2000, Molecular biology and evolution.

[4]  W. Doolittle,et al.  Phylogenetic analysis of the TATA box binding protein (TBP) gene from Nosema locustae: evidence for a microsporidia-fungi relationship and spliceosomal intron loss. , 1999, Molecular biology and evolution.

[5]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[6]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[7]  Fabienne Thomarat,et al.  Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi , 2001, Nature.

[8]  Y. Peer,et al.  Microsporidia: accumulating molecular evidence that a group of amitochondriate and suspectedly primitive eukaryotes are just curious fungi. , 2000, Gene.

[9]  W. Doolittle,et al.  Alpha-tubulin from early-diverging eukaryotic lineages and the evolution of the tubulin family. , 1996, Molecular biology and evolution.

[10]  Olivier Gascuel,et al.  Concerning the NJ algorithm and its unweighted version, UNJ , 1996, Mathematical Hierarchies and Biology.

[11]  Edward Susko,et al.  Covarion shifts cause a long-branch attraction artifact that unites microsporidia and archaebacteria in EF-1alpha phylogenies. , 2004, Molecular biology and evolution.

[12]  M. Steel,et al.  How molecules evolve in eubacteria. , 2000, Molecular biology and evolution.

[13]  J. Huelsenbeck Performance of Phylogenetic Methods in Simulation , 1995 .

[14]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[15]  W. Bruno,et al.  Topological bias and inconsistency of maximum likelihood using wrong models. , 1999, Molecular biology and evolution.

[16]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[17]  R. Debry,et al.  The consistency of several phylogeny-inference methods under varying evolutionary rates. , 1992, Molecular biology and evolution.

[18]  M. Steel,et al.  Modeling the covarion hypothesis of nucleotide substitution. , 1998, Mathematical biosciences.

[19]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[20]  W. Fitch,et al.  An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution , 1970, Biochemical Genetics.

[21]  M Steel,et al.  Invariable sites models and their use in phylogeny reconstruction. , 2000, Systematic biology.

[22]  M. Nei,et al.  Theoretical foundation of the minimum-evolution method of phylogenetic inference. , 1993, Molecular biology and evolution.

[23]  O Gascuel,et al.  Strengths and limitations of the minimum evolution principle. , 2001, Systematic biology.

[24]  Michael D. Hendy,et al.  A Framework for the Quantitative Study of Evolutionary Trees , 1989 .

[25]  P. Meisel Margaret O. Dayhoff: Atlas of Protein Sequence and Structure 1969 (Volume 4) XXIV u. 361 S., 21 Ausklapptafeln, 68 Abb. und zahlreiche Tabellen. National Biomedical Research Foundation, Silver Spring/Maryland 1969. Preis $ 12,50 , 1971 .

[26]  K. Kidd,et al.  Phylogenetic analysis: concepts and methods. , 1971, American journal of human genetics.

[27]  M. Nei Molecular Evolutionary Genetics , 1987 .

[28]  W. Doolittle,et al.  Microsporidia are related to Fungi: evidence from the largest subunit of RNA polymerase II and other proteins. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[29]  M. Nei,et al.  Infinite allele model with varying mutation rate. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[30]  G. B. Golding,et al.  Estimates of DNA and protein sequence divergence: an examination of some assumptions. , 1983, Molecular biology and evolution.

[31]  M. Steel,et al.  General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. , 1997, Molecular phylogenetics and evolution.

[32]  L. Cavalli-Sforza,et al.  PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURES , 1967, Evolution; international journal of organic evolution.

[33]  W. Doolittle,et al.  A kingdom-level phylogeny of eukaryotes based on combined protein data. , 2000, Science.

[34]  Thomas Uzzell,et al.  Fitting Discrete Probability Distributions to Evolutionary Events , 1971, Science.

[35]  P. Lewis,et al.  Success of maximum likelihood phylogeny inference in the four-taxon case. , 1995, Molecular biology and evolution.

[36]  W. Li,et al.  A general additive distance with time-reversibility and rate variation among nucleotide sites. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[37]  A. El'skaya,et al.  Eukaryotic translation elongation factor 1 alpha: structure, expression, functions, and possible role in aminoacyl-tRNA channeling. , 1998, Progress in nucleic acid research and molecular biology.

[38]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[39]  A Rzhetsky,et al.  When is it safe to use an oversimplified substitution model in tree-making? , 1996, Molecular biology and evolution.

[40]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.