Assessment of protein distance measures and tree-building methods for phylogenetic tree reconstruction.

Distance-based methods are popular for reconstructing evolutionary trees of protein sequences, mainly because of their speed and generality. A number of variants of the classical neighbor-joining (NJ) algorithm have been proposed, as well as a number of methods to estimate protein distances. We here present a large-scale assessment of performance in reconstructing the correct tree topology for the most popular algorithms. The programs BIONJ, FastME, Weighbor, and standard NJ were run using 12 distance estimators, producing 48 tree-building/distance estimation method combinations. These were evaluated on a test set based on real trees taken from 100 Pfam families. Each tree was used to generate multiple sequence alignments with the ROSE program using three evolutionary models. The accuracy of each method was analyzed as a function of both sequence divergence and location in the tree. We found that BIONJ produced the overall best results, although the average accuracy differed little between the tree-building methods (normally less than 1%). A noticeable trend was that FastME performed poorer than the rest on long branches. Weighbor was several orders of magnitude slower than the other programs. Larger differences were observed when using different distance estimators. Protein-adapted Jukes-Cantor and Kimura distance correction produced clearly poorer results than the other methods, even worse than uncorrected distances. We also assessed the recently developed Scoredist measure, which performed equally well as more complex methods.

[1]  H. Kishino,et al.  Maximum likelihood inference of protein phylogeny and the origin of chloroplasts , 1990, Journal of Molecular Evolution.

[2]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[3]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[4]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[5]  Pankaj Agarwal,et al.  A Bayesian Evolutionary Distance for Parametrically Aligned Sequences , 1996, J. Comput. Biol..

[6]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[7]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[8]  P. Lio’,et al.  Molecular phylogenetics: state-of-the-art methods for looking into the past. , 2001, Trends in genetics : TIG.

[9]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[10]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[11]  Erik L. L. Sonnhammer,et al.  Scoredist: A simple and robust protein sequence distance estimator , 2005, BMC Bioinformatics.

[12]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[13]  Alex Bateman,et al.  QuickTree: building huge Neighbour-Joining trees of protein sequences , 2002, Bioinform..

[14]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[15]  M. Kimura The Neutral Theory of Molecular Evolution: Introduction , 1983 .

[16]  O. Gascuel,et al.  Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. , 2003, Molecular biology and evolution.

[17]  N Takezaki,et al.  Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. , 1996, Molecular biology and evolution.

[18]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[19]  A Rzhetsky,et al.  Phylogenetic test of the molecular clock and linearized trees. , 1995, Molecular biology and evolution.

[20]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[21]  D. Penny,et al.  The Use of Tree Comparison Metrics , 1985 .

[22]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[23]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[24]  Martin Vingron,et al.  Modeling Amino Acid Replacement , 2000, J. Comput. Biol..

[25]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[26]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[27]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[28]  S. Jeffery Evolution of Protein Molecules , 1979 .

[29]  N. Saitou,et al.  Relative Efficiencies of the Fitch-Margoliash, Maximum-Parsimony, Maximum-Likelihood, Minimum-Evolution, and Neighbor-joining Methods of Phylogenetic Tree Construction in Obtaining the Correct Tree , 1989 .

[30]  D A Morrison,et al.  Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. , 1997, Molecular biology and evolution.

[31]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[32]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[33]  Bernard M. E. Moret,et al.  An investigation of phylogenetic likelihood methods , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[34]  M. Nei,et al.  Phylogenetic analysis in molecular evolutionary genetics. , 1996, Annual review of genetics.

[35]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.