The Impact of Outgroup Choice and Missing Data on Major Seed Plant Phylogenetics Using Genome-Wide EST Data

Background Genome level analyses have enhanced our view of phylogenetics in many areas of the tree of life. With the production of whole genome DNA sequences of hundreds of organisms and large-scale EST databases a large number of candidate genes for inclusion into phylogenetic analysis have become available. In this work, we exploit the burgeoning genomic data being generated for plant genomes to address one of the more important plant phylogenetic questions concerning the hierarchical relationships of the several major seed plant lineages (angiosperms, Cycadales, Gingkoales, Gnetales, and Coniferales), which continues to be a work in progress, despite numerous studies using single, few or several genes and morphology datasets. Although most recent studies support the notion that gymnosperms and angiosperms are monophyletic and sister groups, they differ on the topological arrangements within each major group. Methodology We exploited the EST database to construct a supermatrix of DNA sequences (over 1,200 concatenated orthologous gene partitions for 17 taxa) to examine non-flowering seed plant relationships. This analysis employed programs that offer rapid and robust orthology determination of novel, short sequences from plant ESTs based on reference seed plant genomes. Our phylogenetic analysis retrieved an unbiased (with respect to gene choice), well-resolved and highly supported phylogenetic hypothesis that was robust to various outgroup combinations. Conclusions We evaluated character support and the relative contribution of numerous variables (e.g. gene number, missing data, partitioning schemes, taxon sampling and outgroup choice) on tree topology, stability and support metrics. Our results indicate that while missing characters and order of addition of genes to an analysis do not influence branch support, inadequate taxon sampling and limited choice of outgroup(s) can lead to spurious inference of phylogeny when dealing with phylogenomic scale data sets. As expected, support and resolution increases significantly as more informative characters are added, until reaching a threshold, beyond which support metrics stabilize, and the effect of adding conflicting characters is minimized.

[1]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[2]  Rob DeSalle,et al.  The Widespread Colonization Island of Actinobacillus actinomycetemcomitans , 2003, Nature Genetics.

[3]  W. Martin,et al.  Molecular Data from the Chloroplast rpoC1 Gene Suggest a Deep and Distinct Dichotomy of Contemporary Spermatophytes into Two Monophyla: Gymnosperms (Including Gnetales) and Angiosperms , 1999, Journal of Molecular Evolution.

[4]  M. Hasebe,et al.  Phylogeny of gymnosperms inferred fromrbcL gene sequences , 1992, The botanical magazine = Shokubutsu-gaku-zasshi.

[5]  F. Delsuc,et al.  Phylogenomics: the beginning of incongruence? , 2006, Trends in genetics : TIG.

[6]  Michael J. Donoghue,et al.  Seed plant phylogeny and the origin of angiosperms: An experimental cladistic approach , 1986, The Botanical Review.

[7]  H. A. Schneider-Poetsch,et al.  The Evolution of Gymnosperms Redrawn by Phytochrome Genes: The Gnetatae Appear at the Base of the Gymnosperms , 2002, Journal of Molecular Evolution.

[8]  P. Crane Time for the angiosperms , 1993, Nature.

[9]  D. Soltis,et al.  Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology , 1999, Nature.

[10]  Carol J. Bult,et al.  Constructing a Significance Test for Incongruence , 1995 .

[11]  J. Wiens,et al.  Missing data, incomplete taxa, and phylogenetic accuracy. , 2003, Systematic biology.

[12]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[13]  Alexandros Stamatakis,et al.  Efficient computation of the phylogenetic likelihood function on multi-gene alignments and multi-core architectures , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[14]  James F. Smith Phylogenetics of seed plants : An analysis of nucleotide sequences from the plastid gene rbcL , 1993 .

[15]  H. Mewes,et al.  How can we deliver the large plant genomes? Strategies and perspectives. , 2002, Current opinion in plant biology.

[16]  Alexandros Stamatakis,et al.  Exploiting Fine-Grained Parallelism in the Phylogenetic Likelihood Function with MPI, Pthreads, and OpenMP: A Performance Study , 2008, PRIB.

[17]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[18]  Peter R. Crane,et al.  Phylogenetic analysis of seed plants and the origin of angiosperms , 1985 .

[19]  D. Penny,et al.  The place of Amborella within the radiation of angiosperms. , 2005, Trends in plant science.

[20]  R. Baker,et al.  Hidden likelihood support in genomic data: can forty-five wrongs make a right? , 2005, Systematic biology.

[21]  H. Philippe,et al.  Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. , 2005, Molecular biology and evolution.

[22]  J. Rougemont,et al.  A rapid bootstrap algorithm for the RAxML Web servers. , 2008, Systematic biology.

[23]  K. Bremer,et al.  BRANCH SUPPORT AND TREE STABILITY , 1994 .

[24]  K. Nixon,et al.  Functional Constraints and rbcL Evidence for Land Plant Phylogeny , 1994 .

[25]  G. Rothwell,et al.  Lignophyte phylogeny and the evolution of spermatophytes : a numerical cladistic analysis , 1994 .

[26]  S. Papson,et al.  “Model” , 1981 .

[27]  Pamela S Soltis,et al.  Genome-scale data, angiosperm relationships, and "ending incongruence": a cautionary tale in phylogenetics. , 2004, Trends in plant science.

[28]  J. Palmer,et al.  Seed plant phylogeny inferred from all three plant genomes: monophyly of extant gymnosperms and origin of Gnetales from conifers. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Rob DeSalle,et al.  How many genes should a systematist sample? Conflicting insights from a phylogenomic matrix characterized by replicated incongruence. , 2007, Systematic biology.

[30]  P. Holland,et al.  Phylogenomics of eukaryotes: impact of missing data on large alignments. , 2004, Molecular biology and evolution.

[31]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[32]  David Posada,et al.  ProtTest: selection of best-fit models of protein evolution , 2005, Bioinform..

[33]  J. Doyle Molecules, morphology, fossils, and the relationship of angiosperms and Gnetales. , 1998, Molecular phylogenetics and evolution.

[34]  Sarah Mathews,et al.  Phylogenetic relationships among seed plants: Persistent questions and the limits of molecular data. , 2009, American journal of botany.

[35]  M. Sanderson,et al.  Molecular evidence on plant divergence times. , 2004, American journal of botany.

[36]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[37]  G. Lecointre,et al.  When does the incongruence length difference test fail? , 2002, Molecular biology and evolution.

[38]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[39]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[40]  H. Saedler,et al.  MADS-box genes reveal that gnetophytes are more closely related to conifers than to flowering plants. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Huanming Yang,et al.  A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. japonica) , 2002, Science.

[42]  Pamela S Soltis,et al.  Phylogeny of seed plants based on evidence from eight genes. , 2002, American journal of botany.

[43]  Srinivas Aluru,et al.  Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[44]  D. S. Parker,et al.  The Mostly Male Theory of Flower Evolutionary Origins: from Genes to Fossils , 2000 .

[45]  R. Baker,et al.  Corroboration among Data Sets in Simultaneous Analysis: Hidden Support for Phylogenetic Relationships among Higher Level Artiodactyl Taxa , 1999, Cladistics : the international journal of the Willi Hennig Society.

[46]  F. K. Barker,et al.  The utility of the incongruence length difference test. , 2002, Systematic biology.

[47]  C. Orme,et al.  Noise and incongruence: interpreting results of the incongruence length difference test. , 2000, Molecular phylogenetics and evolution.

[48]  A. Hipp,et al.  Congruence versus phylogenetic accuracy: revisiting the incongruence length difference test. , 2004, Systematic biology.

[49]  G. Theißen,et al.  The major clades of MADS-box genes and their role in the development and evolution of flowering plants. , 2003, Molecular phylogenetics and evolution.

[50]  Rob DeSalle,et al.  ESTimating plant phylogeny: lessons from partitioning , 2006, BMC Evolutionary Biology.

[51]  Michael J. Donoghue,et al.  Seed plant phylogeny: Demise of the anthophyte hypothesis? , 2000, Current Biology.

[52]  Gloria M. Coruzzi,et al.  Automated simultaneous analysis phylogenetics (ASAP): an enabling tool for phlyogenomics , 2008, BMC Bioinformatics.

[53]  C. Bult,et al.  TESTING SIGNIFICANCE OF INCONGRUENCE , 1994 .

[54]  Ward C. Wheeler,et al.  NUCLEIC ACID SEQUENCE PHYLOGENY AND RANDOM OUTGROUPS , 1990, Cladistics : the international journal of the Willi Hennig Society.

[55]  D. Stevenson,et al.  Cladistics of the Spermatophyta , 1990, Brittonia.

[56]  Alexandros Stamatakis,et al.  Phylogenetic models of rate heterogeneity: a high performance computing perspective , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[57]  A. Oliphant,et al.  A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). , 2002, Science.

[58]  C. dePamphilis,et al.  Phylogeny of seed plants based on all three genomic compartments: extant gymnosperms are monophyletic and Gnetales' closest relatives are conifers. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[59]  D. Soltis,et al.  The phylogeny of land plants inferred from 18S rDNA sequences: pushing the limits of rDNA signal? , 1999, Molecular biology and evolution.

[60]  W. Martin,et al.  Noncoding sequences from the slowly evolving chloroplast inverted repeat in addition to rbcL data do not support gnetalean affinities of angiosperms. , 1996, Molecular biology and evolution.

[61]  Yi Hu,et al.  Floral gene resources from basal angiosperms for comparative genomics research , 2005, BMC Plant Biology.

[62]  R DeSalle,et al.  Multiple sources of character information and the phylogeny of Hawaiian drosophilids. , 1997, Systematic biology.

[63]  Kevin C. Nixon,et al.  A Reevaluation of Seed Plant Phylogeny , 1994 .

[64]  R. DeSalle Animal phylogenomics: multiple interspecific genome comparisons. , 2005, Methods in enzymology.

[65]  F. James Rohlf,et al.  ACCURACY OF ESTIMATED PHYLOGENIES: EFFECTS OF TREE TOPOLOGY AND EVOLUTIONARY MODEL , 1990, Evolution; international journal of organic evolution.

[66]  Dennis Shasha,et al.  Sungear: interactive visualization and functional analysis of genomic datasets , 2007, Bioinform..

[67]  Gloria M. Coruzzi,et al.  OrthologID: automation of genome-scale ortholog identification within a parsimony framework , 2006, Bioinform..