Detecting and overcoming systematic errors in genome-scale phylogenies.

Genome-scale data sets result in an enhanced resolution of the phylogenetic inference by reducing stochastic errors. However, there is also an increase of systematic errors due to model violations, which can lead to erroneous phylogenies. Here, we explore the impact of systematic errors on the resolution of the eukaryotic phylogeny using a data set of 143 nuclear-encoded proteins from 37 species. The initial observation was that, despite the impressive amount of data, some branches had no significant statistical support. To demonstrate that this lack of resolution is due to a mutual annihilation of phylogenetic and nonphylogenetic signals, we created a series of data sets with slightly different taxon sampling. As expected, these data sets yielded strongly supported but mutually exclusive trees, thus confirming the presence of conflicting phylogenetic and nonphylogenetic signals in the original data set. To decide on the correct tree, we applied several methods expected to reduce the impact of some kinds of systematic error. Briefly, we show that (i) removing fast-evolving positions, (ii) recoding amino acids into functional categories, and (iii) using a site-heterogeneous mixture model (CAT) are three effective means of increasing the ratio of phylogenetic to nonphylogenetic signal. Finally, our results allow us to formulate guidelines for detecting and overcoming phylogenetic artefacts in genome-scale phylogenetic analyses.

[1]  Edward Susko,et al.  Likelihood, parsimony, and heterogeneous evolution. , 2005, Molecular biology and evolution.

[2]  H. Philippe,et al.  Comparison of molecular and paleontological data in diatoms suggests a major gap in the fossil record , 1994 .

[3]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[4]  H. Phillipe The molecular phylogeny of eukaryota: solid facts and uncertainties , 1998 .

[5]  K. Strimmer,et al.  TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics , 2004, BMC Evolutionary Biology.

[6]  Mark W. Chase,et al.  The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes , 1999, Nature.

[7]  Hervé Philippe,et al.  The origin of red algae and the evolution of chloroplasts , 2000, Nature.

[8]  R. Gutell,et al.  Are red algae plants , 1995 .

[9]  A. Knoll,et al.  The early evolution of eukaryotes: a geological perspective. , 1992, Science.

[10]  H. Kishino,et al.  Maximum likelihood inference of protein phylogeny and the origin of chloroplasts , 1990, Journal of Molecular Evolution.

[11]  D. Bryant,et al.  Site interdependence attributed to tertiary structure in amino acid sequence evolution. , 2005, Gene.

[12]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[13]  D. Penny,et al.  Genome-scale phylogeny and the detection of systematic biases. , 2004, Molecular biology and evolution.

[14]  Hervé Philippe,et al.  An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. , 2005, Systematic biology.

[15]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[16]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[17]  Faisal Ababneh,et al.  Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences , 2006, Bioinform..

[18]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[19]  Masami Hasegawa,et al.  Ribosomal RNA trees misleading? , 1993, Nature.

[20]  W. Murphy,et al.  Resolution of the Early Placental Mammal Radiation Using Bayesian Phylogenetics , 2001, Science.

[21]  J. Palmer,et al.  Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? , 2004, BMC Evolutionary Biology.

[22]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[23]  P. Lockhart,et al.  Substitutional bias confounds inference of cyanelle origins from sequence data , 1992, Journal of Molecular Evolution.

[24]  Diana J. Kao,et al.  Parallel adaptive radiations in two major clades of placental mammals , 2001, Nature.

[25]  Jinling Huang,et al.  Ancient horizontal gene transfer can benefit phylogenetic reconstruction. , 2006, Trends in genetics : TIG.

[26]  F. Delsuc,et al.  Phylogenomics and the reconstruction of the tree of life , 2005, Nature Reviews Genetics.

[27]  Peter G Foster,et al.  Modeling compositional heterogeneity. , 2004, Systematic biology.

[28]  H. Philippe,et al.  Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model , 2007, BMC Evolutionary Biology.

[29]  Andrew J. Roger,et al.  Biases in Phylogenetic Estimation Can Be Caused by Random Sequence Segments , 2005, Journal of Molecular Evolution.

[30]  H. Philippe,et al.  MUST, a computer package of Management Utilities for Sequences and Trees. , 1993, Nucleic acids research.

[31]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[32]  F. Delsuc,et al.  Phylogenomics: the beginning of incongruence? , 2006, Trends in genetics : TIG.

[33]  M. Gouy,et al.  Inferring phylogenies from DNA sequences of unequal base compositions. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Z. Yang,et al.  On the use of nucleic acid sequences to infer early branchings in the tree of life. , 1995, Molecular biology and evolution.

[35]  J. Adachi,et al.  MOLPHY version 2.3 : programs for molecular phylogenetics based on maximum likelihood , 1996 .

[36]  D. Soltis,et al.  Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology , 1999, Nature.

[37]  Terry Gaasterland,et al.  The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Bryan Kolaczkowski,et al.  Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous , 2004, Nature.

[39]  Michael D. Hendy,et al.  A Framework for the Quantitative Study of Evolutionary Trees , 1989 .

[40]  M. Pagel,et al.  A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. , 2004, Systematic biology.

[41]  Naiara Rodríguez-Ezpeleta,et al.  Monophyly of Primary Photosynthetic Eukaryotes: Green Plants, Red Algae, and Glaucophytes , 2005, Current Biology.

[42]  E. Herniou,et al.  Acoel flatworms: earliest extant bilaterian Metazoans, not members of Platyhelminthes. , 1999, Science.

[43]  Frédéric Delsuc,et al.  Heterotachy and long-branch attraction in phylogenetics , 2005, BMC Evolutionary Biology.

[44]  H. Philippe,et al.  Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. , 2005, Molecular biology and evolution.

[45]  David T. Jones,et al.  Protein evolution with dependence among codons due to tertiary structure. , 2003, Molecular biology and evolution.

[46]  H Philippe,et al.  How many nucleotides are required to resolve a phylogenetic problem? The use of a new statistical method applicable to available sequences. , 1994, Molecular phylogenetics and evolution.

[47]  H. Philippe,et al.  Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. , 1999, Molecular biology and evolution.

[48]  A. Bowker,et al.  A test for symmetry in contingency tables. , 1948, Journal of the American Statistical Association.

[49]  F. Delsuc,et al.  Tunicates and not cephalochordates are the closest living relatives of vertebrates , 2006, Nature.

[50]  C R Woese,et al.  Archaeal phylogeny: reexamination of the phylogenetic position of Archaeoglobus fulgidus in light of certain composition-induced artifacts. , 1991, Systematic and applied microbiology.

[51]  G. Olsen,et al.  Earliest phylogenetic branchings: comparing rRNA-based evolutionary trees inferred with various techniques. , 1987, Cold Spring Harbor symposia on quantitative biology.

[52]  L. Salter,et al.  Complexity of the likelihood surface for a large DNA dataset. , 2001, Systematic biology.

[53]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[54]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[55]  J. Lake,et al.  Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[56]  M. Telford Phylogenomics , 2007, Current Biology.

[57]  James R. Brown Ancient horizontal gene transfer , 2003, Nature Reviews Genetics.

[58]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[59]  P. Lewis,et al.  Effects of nucleotide composition bias on the success of the parsimony criterion in phylogenetic inference. , 2001, Molecular biology and evolution.

[60]  Nicolas Lartillot,et al.  A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. , 2006, Molecular biology and evolution.

[61]  Richard A. Goldstein,et al.  Analyzing Rate Heterogeneity During Protein Evolution , 2000, Pacific Symposium on Biocomputing.

[62]  J. G. Burleigh,et al.  Phylogenetic signal in nucleotide data from seed plants: implications for resolving the seed plant tree of life. , 2004, American journal of botany.

[63]  Hervé Philippe,et al.  Phylogeny: A non-hyperthermophilic ancestor for Bacteria , 2002, Nature.

[64]  Masatoshi Nei,et al.  The number of nucleotides required to determine the branching order of three species, with special reference to the human-chimpanzee-gorilla divergence , 2005, Journal of Molecular Evolution.

[65]  R. Raff,et al.  Evidence for a clade of nematodes, arthropods and other moulting animals , 1997, Nature.

[66]  T. Embley,et al.  Trichomonas hydrogenosomes contain the NADH dehydrogenase module of mitochondrial complex I , 2004, Nature.

[67]  M. Miyamoto,et al.  Constraints on protein evolution and the age of the eubacteria/eukaryote split. , 1996, Systematic biology.

[68]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[69]  W. Doolittle,et al.  Microsporidia are related to Fungi: evidence from the largest subunit of RNA polymerase II and other proteins. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[70]  G. McFadden,et al.  Evolution: Red Algal Genome Affirms a Common Origin of All Plastids , 2004, Current Biology.

[71]  M. Steel,et al.  Recovering evolutionary trees under a more realistic model of sequence evolution. , 1994, Molecular biology and evolution.

[72]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[73]  Hervé Philippe,et al.  Lack of resolution in the animal phylogeny: closely spaced cladogeneses or undetected systematic errors? , 2007, Molecular biology and evolution.

[74]  Martin Vingron,et al.  TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing , 2002, Bioinform..

[75]  Joseph T. Chang,et al.  Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. , 1996, Mathematical biosciences.

[76]  Guangli Zhu,et al.  Apicoplast genome of the coccidian Eimeria tenella. , 2003, Gene.

[77]  A. Simpson,et al.  The real ‘kingdoms’ of eukaryotes , 2004, Current Biology.

[78]  Mikael Thollesson,et al.  LDDist: a Perl module for calculating LogDet pair-wise distances for protein and nucleotide sequences , 2004, Bioinform..

[79]  G. H. Coombs,et al.  Evolutionary relationships among protozoa. , 1998 .

[80]  W. Doolittle,et al.  A kingdom-level phylogeny of eukaryotes based on combined protein data. , 2000, Science.