Phylogenetic signal and noise: predicting the power of a data set to resolve phylogeny.

A principal objective for phylogenetic experimental design is to predict the power of a data set to resolve nodes in a phylogenetic tree. However, proactively assessing the potential for phylogenetic noise compared with signal in a candidate data set has been a formidable challenge. Understanding the impact of collection of additional sequence data to resolve recalcitrant internodes at diverse historical times will facilitate increasingly accurate and cost-effective phylogenetic research. Here, we derive theory based on the fundamental unit of the phylogenetic tree, the quartet, that applies estimates of the state space and the rates of evolution of characters in a data set to predict phylogenetic signal and phylogenetic noise and therefore to predict the power to resolve internodes. We develop and implement a Monte Carlo approach to estimating power to resolve as well as deriving a nearly equivalent faster deterministic calculation. These approaches are applied to describe the distribution of potential signal, polytomy, or noise for two example data sets, one recent (cytochrome c oxidase I and 28S ribosomal rRNA sequences from Diplazontinae parasitoid wasps) and one deep (eight nuclear genes and a phylogenomic sequence for diverse microbial eukaryotes including Stramenopiles, Alveolata, and Rhizaria). The predicted power of resolution for the loci analyzed is consistent with the historic use of the genes in phylogenetics.

[1]  D. Wells,et al.  Histone and histone gene compilation and alignment update. , 1991, Nucleic acids research.

[2]  Samuel Kotz,et al.  Exact Distribution of the Max/Min of Two Gaussian Random Variables , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Masatoshi Nei,et al.  The number of nucleotides required to determine the branching order of three species, with special reference to the human-chimpanzee-gorilla divergence , 2005, Journal of Molecular Evolution.

[4]  Derrick J. Zwickl,et al.  Increased taxon sampling greatly reduces phylogenetic error. , 2002, Systematic biology.

[5]  J. Adachi,et al.  Phylogenetic position of kinetoplastid protozoa inferred from the protein phylogenies of elongation factors 1alpha and 2. , 1996, Journal of biochemistry.

[6]  Avin,et al.  Amphioxus Mitochondrial DNA , Chordate Phylogeny , and the Limits of Inference Based on Comparisons of Sequences , 2003 .

[7]  D. Soltis,et al.  Phylogeny of the Caryophyllales Sensu Lato: Revisiting Hypotheses on Pollination Biology and Perianth Differentiation in the Core Caryophyllales , 2009, International Journal of Plant Sciences.

[8]  Mark P. Simmons,et al.  Relative character-state space, amount of potential phylogenetic information, and heterogeneity of nucleotide and amino acid characters. , 2004, Molecular phylogenetics and evolution.

[9]  Jack Sullivan,et al.  Model Selection in Phylogenetics , 2005 .

[10]  Zih E N G Ya N,et al.  On the Best Evolutionary Rate for Phylogenetic Analysis , 1998 .

[11]  L. Katz,et al.  Phylogenetic placement of diverse amoebae inferred from multigene analyses and assessment of clade stability within 'Amoebozoa' upon removal of varying rate classes of SSU-rDNA. , 2008, Molecular phylogenetics and evolution.

[12]  R. Page,et al.  Phylogenetic Noise Leads to Unbalanced Cladistic Tree Reconstructions , 1995 .

[13]  A. Graybeal,et al.  Is it better to add taxa or characters to a difficult phylogenetic problem? , 1998, Systematic biology.

[14]  B. Rannala,et al.  Taxon sampling and the accuracy of large phylogenies. , 1998, Systematic biology.

[15]  F. Delsuc,et al.  Phylogenomics: the beginning of incongruence? , 2006, Trends in genetics : TIG.

[16]  L. Katz,et al.  Broadly sampled multigene analyses yield a well-resolved eukaryotic tree of life. , 2010, Systematic biology.

[17]  Y. Inagaki,et al.  Large-Scale Phylogenomic Analyses Reveal That Two Enigmatic Protist Lineages, Telonemia and Centroheliozoa, Are Related to Photosynthetic Chromalveolates , 2009, Genome biology and evolution.

[18]  R. Debry Identifying conflicting signal in a multigene analysis reveals a highly resolved tree: the phylogeny of Rodentia (Mammalia). , 2003, Systematic biology.

[19]  D. Wells,et al.  A comprehensive compilation and alignment of histones and histone genes. , 1989, Nucleic acids research.

[20]  Jerrold I. Davis,et al.  Character‐state space versus rate of evolution in phylogenetic inference , 2004, Cladistics : the international journal of the Willi Hennig Society.

[21]  Masami Hasegawa,et al.  Rooting the eutherian tree: the power and pitfalls of phylogenomics , 2007, Genome Biology.

[22]  T. Moum,et al.  POLYTOMIES AND THE POWER OF PHYLOGENETIC INFERENCE , 1999, Evolution; international journal of organic evolution.

[23]  A. Dress,et al.  Reconstructing the shape of a tree from observed dissimilarity data , 1986 .

[24]  D. Hillis Inferring complex phytogenies , 1996, Nature.

[25]  W. Doolittle,et al.  A kingdom-level phylogeny of eukaryotes based on combined protein data. , 2000, Science.

[26]  Nick Goldman,et al.  Phylogenetic information and experimental design in molecular systematics , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[27]  Jeffrey P Townsend,et al.  Profiling phylogenetic informativeness. , 2007, Systematic biology.

[28]  D. Soltis,et al.  Rosid radiation and the rapid rise of angiosperm-dominated forests , 2009, Proceedings of the National Academy of Sciences.

[29]  K. Hilu,et al.  Impact of missing data, gene choice, and taxon sampling on phylogenetic reconstruction: the Caryophyllales (angiosperms) , 2011, Plant Systematics and Evolution.

[30]  D. Hillis Inferring complex phylogenies. , 1996, Nature.

[31]  W. Brown,et al.  Structural biology and phylogenetic estimation , 1997, Nature.

[32]  S. Carroll,et al.  Frequent and widespread parallel evolution of protein sequences. , 2008, Molecular biology and evolution.

[33]  Derrick J. Zwickl,et al.  Is sparse taxon sampling a problem for phylogenetic inference? , 2003, Systematic biology.

[34]  F. Taylor,et al.  Ultrastructure as a Control for Protistan Molecular Phylogeny , 1999, The American Naturalist.

[35]  H. Philippe,et al.  Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model , 2007, BMC Evolutionary Biology.

[36]  D. Quicke,et al.  An evaluation of phylogenetic informativeness profiles and the molecular phylogeny of diplazontinae (Hymenoptera, Ichneumonidae). , 2010, Systematic biology.

[37]  J. Townsend,et al.  Taxon sampling and the optimal rates of evolution for phylogenetic inference. , 2011, Systematic biology.

[38]  D. Hillis,et al.  When are phylogenetic analyses misled by convergence? A case study in Texas cave salamanders. , 2003, Systematic biology.

[39]  Pamela S Soltis,et al.  Genome-scale data, angiosperm relationships, and "ending incongruence": a cautionary tale in phylogenetics. , 2004, Trends in plant science.

[40]  J. Proudfoot,et al.  Noise , 1931, The Indian medical gazette.

[41]  Heather M. Amrine,et al.  Mitochondrial versus nuclear gene sequences in deep-level mammalian phylogeny reconstruction. , 2001, Molecular biology and evolution.

[42]  Michael J. Sanderson,et al.  R8s: Inferring Absolute Rates of Molecular Evolution, Divergence times in the Absence of a Molecular Clock , 2003, Bioinform..

[43]  D. Penny,et al.  Genome-scale phylogeny and the detection of systematic biases. , 2004, Molecular biology and evolution.

[44]  Mike Steel,et al.  Sequence length bounds for resolving a deep phylogenetic divergence. , 2008, Journal of theoretical biology.

[45]  S. Carroll,et al.  Bushes in the Tree of Life , 2006, PLoS biology.

[46]  J. Huelsenbeck,et al.  Signal, noise, and reliability in molecular phylogenetic analyses. , 1992, The Journal of heredity.

[47]  J. Townsend,et al.  Optimal selection of gene and ingroup taxon sampling for resolving phylogenetic relationships. , 2010, Systematic biology.

[48]  D. Bhattacharya,et al.  Phylogenomic analysis supports the monophyly of cryptophytes and haptophytes and the association of rhizaria with chromalveolates. , 2007, Molecular biology and evolution.

[49]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[50]  Kamran Shalchian-Tabrizi,et al.  Phylogenomics Reshuffles the Eukaryotic Supergroups , 2007, PloS one.