The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated.

The effect of compositional heterogeneity in sequence data on phylogenetic inference was first identified as a potential problem in the late 1980s and early 1990s (Chang and Campbell, 2000; Conant and Lewis, 2001; Foster and Hickey, 1999; Hasegawa et al., 1993; Klenk et al., 1994; Lockhart et al., 1992a, 1992b; Loomis and Smith, 1990; Olsen and Woese, 1993; Penny et al., 1990; Sogin et al., 1993; Tarrio et al., 2001; Van Den Bussche et al., 1998; Weisburg et al., 1989), and by 1993 the first methods had been developed to measure the extent of the problem (Lockhart et al., 1993, 1994; Steel et al., 1993, 1995) or to overcome it (Foster, 2004; Galtier and Gouy, 1995, 1998; Galtier et al., 1999; Gu and Li, 1996, 1998; Lake, 1994; Steel, 1994; Steel et al., 1993, 1995; Tamura and Kumar, 2002; Yang and Roberts, 1995). It is now widely accepted that compositional heterogeneity in aligned sequence data can mislead methods commonly used to infer phylogenetic trees, but it is still unclear (i) why phylogenetic studies based on the LogDet (or paralinear) distance (Lockhart et al., 1994; Steel, 1994) sometimes fail to recover the expected tree topology from compositionally heterogeneous alignments (e.g., Foster and Hickey, 1999; Tarrio et al., 2001), and (ii) how much compositional convergence is necessary before the phylogenetic methods fail to recover the correct topology. Using Monte Carlo simulations to address the second point, Conant and Lewis (2001) concluded that “rather extreme amounts of convergence are necessary before parsimony begins to prefer the incorrect tree.” Other simulation studies have reached similar conclusions (e.g., Galtier and Gouy, 1995; Rosenberg and Kumar, 2003; Van Den Bussche et al., 1998). Based on the study by Galtier and Gouy (1995), it would appear that it is safe to use DNA for phylogenetic inference as long as the difference in GC content is less than 8% to 10%. This im-

[1]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[2]  D Penny,et al.  A frequency-dependent significance test for parsimony. , 1995, Molecular phylogenetics and evolution.

[3]  A. von Haeseler,et al.  Distance measures in terms of substitution processes. , 1999, Theoretical population biology.

[4]  W. Li,et al.  Estimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Faisal Ababneh,et al.  Hetero: a program to simulate the evolution of DNA on a four-taxon tree. , 2003, Applied bioinformatics.

[6]  C. Saccone,et al.  A simple quantitative model of the molecular clock , 2005, Journal of Molecular Evolution.

[7]  J. Huelsenbeck,et al.  Base compositional bias and phylogenetic analyses: a test of the "flying DNA" hypothesis. , 1998, Molecular phylogenetics and evolution.

[8]  F. Ayala,et al.  Shared nucleotide composition biases among species and their impact on phylogenetic reconstructions of the Drosophilidae. , 2001, Molecular biology and evolution.

[9]  P J Waddell,et al.  Using novel phylogenetic methods to evaluate mammalian mtDNA, including amino acid-invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the positions of hedgehog, armadillo, and elephant. , 1999, Systematic biology.

[10]  C. Woese,et al.  The Deinococcus-Thermus phylum and the effect of rRNA composition on phylogenetic tree construction. , 1989, Systematic and Applied Microbiology.

[11]  D Penny,et al.  Trees from sequences: panacea or Pandora's box. , 1990 .

[12]  M. Sogin,et al.  Universal tree of life , 1993, Nature.

[13]  A. Bowker,et al.  A test for symmetry in contingency tables. , 1948, Journal of the American Statistical Association.

[14]  S Kumar,et al.  Disparity index: a simple statistic to measure and test the homogeneity of substitution patterns between molecular sequences. , 2001, Genetics.

[15]  Sudhir Kumar,et al.  Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference. , 2003, Molecular biology and evolution.

[16]  J. Lake,et al.  Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Erratum: Disparity index: A simple statistic to measure and test the homogeneity of substitution patterns between molecular sequences (Genetics (158) (1321-1327)) , 2001 .

[18]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[19]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[20]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[21]  M. Steel,et al.  Recovering evolutionary trees under a more realistic model of sequence evolution. , 1994, Molecular biology and evolution.

[22]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[23]  A. Stuart A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION , 1955 .

[24]  A Rzhetsky,et al.  Tests of applicability of several substitution models for DNA sequence data. , 1995, Molecular biology and evolution.

[25]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[26]  S. Ho,et al.  Tracing the decay of the historical signal in biological sequence data. , 2004, Systematic biology.

[27]  G. Olsen,et al.  Ribosomal RNA: a key to phylogeny , 1993, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[28]  W F Loomis,et al.  Molecular phylogeny of Dictyostelium discoideum by protein sequence comparison. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[29]  C Saccone,et al.  Transition and transversion rate in the evolution of animal mitochondrial DNA. , 1986, Bio Systems.

[30]  M. Gouy,et al.  A nonhyperthermophilic common ancestor to extant life forms. , 1999, Science.

[31]  A. Austin,et al.  The evolution of strand-specific compositional bias. A case study in the Hymenopteran mitochondrial 16S rRNA gene. , 1997, Molecular biology and evolution.

[32]  M. Steel,et al.  General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. , 1997, Molecular phylogenetics and evolution.

[33]  Peter G Foster,et al.  Modeling compositional heterogeneity. , 2004, Systematic biology.

[34]  M. Gouy,et al.  Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. , 1998, Molecular biology and evolution.

[35]  Ziheng Yang,et al.  Maximum-likelihood models for combined analyses of multiple sequence data , 1996, Journal of Molecular Evolution.

[36]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[37]  B. Chang,et al.  Bias in phylogenetic reconstruction of vertebrate rhodopsin sequences. , 2000, Molecular biology and evolution.

[38]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[39]  Z. Yang,et al.  On the use of nucleic acid sequences to infer early branchings in the tree of life. , 1995, Molecular biology and evolution.

[40]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[41]  P. Lewis,et al.  Effects of nucleotide composition bias on the success of the parsimony criterion in phylogenetic inference. , 2001, Molecular biology and evolution.

[42]  Sudhir Kumar,et al.  Evolutionary distance estimation under heterogeneous substitution pattern among lineages. , 2002, Molecular biology and evolution.

[43]  Michael D. Hendy,et al.  IsProchlorothrix hollandica the best choice as a prokaryotic model for higher plant Chla/b photosynthesis? , 1993, Photosynthesis Research.

[44]  M. Gouy,et al.  Inferring phylogenies from DNA sequences of unequal base compositions. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[45]  H. Klenk,et al.  DNA-dependent RNA polymerases as phylogenetic marker molecules , 1993 .

[46]  W. Li,et al.  Bias-corrected paralinear and LogDet distances and tests of molecular clocks and phylogenies under nonstationary nucleotide frequencies. , 1996, Molecular biology and evolution.

[47]  J. Adachi,et al.  Phylogenetic place of mitochondrion-lacking protozoan, Giardia lamblia, inferred from amino acid sequences of elongation factor 2. , 1995, Molecular biology and evolution.

[48]  R. Crozier,et al.  Analysis of directional mutation pressure and nucleotide content in mitochondrial cytochrome b genes , 1994, Journal of Molecular Evolution.

[49]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[50]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[51]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[52]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[53]  M. Hasegawa,et al.  Protein phylogeny gives a robust estimation for early divergences of eukaryotes: phylogenetic place of a mitochondria-lacking protozoan, Giardia lamblia. , 1994, Molecular biology and evolution.

[54]  L. Jermiin,et al.  Nucleotide Composition Bias Affects Amino Acid Content in Proteins Coded by Animal Mitochondria , 1997, Journal of Molecular Evolution.

[55]  P. Lockhart,et al.  Substitutional bias confounds inference of cyanelle origins from sequence data , 1992, Journal of Molecular Evolution.

[56]  T. Miyata,et al.  Early branchings in the evolution of eukaryotes: Ancient divergence of entamoeba that lacks mitochondria revealed by protein sequence data , 1993, Journal of Molecular Evolution.

[57]  D. Penny,et al.  Controversy on chloroplast origins , 1992, FEBS Letters.

[58]  Martin Vingron,et al.  TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing , 2002, Bioinform..

[59]  Peter G. Foster,et al.  Compositional Bias May Affect Both DNA-Based and Protein-Based Phylogenetic Reconstructions , 1999, Journal of Molecular Evolution.

[60]  M. A. Steel,et al.  Confidence in evolutionary trees from biological sequence data , 1993, Nature.