Some Limitations of Public Sequence Data for Phylogenetic Inference (in Plants)

The GenBank database contains essentially all of the nucleotide sequence data generated for published molecular systematic studies, but for the majority of taxa these data remain sparse. GenBank has value for phylogenetic methods that leverage data–mining and rapidly improving computational methods, but the limits imposed by the sparse structure of the data are not well understood. Here we present a tree representing 13,093 land plant genera—an estimated 80% of extant plant diversity—to illustrate the potential of public sequence data for broad phylogenetic inference in plants, and we explore the limits to inference imposed by the structure of these data using theoretical foundations from phylogenetic data decisiveness. We find that despite very high levels of missing data (over 96%), the present data retain the potential to inform over 86.3% of all possible phylogenetic relationships. Most of these relationships, however, are informed by small amounts of data—approximately half are informed by fewer than four loci, and more than 99% are informed by fewer than fifteen. We also apply an information theoretic measure of branch support to assess the strength of phylogenetic signal in the data, revealing many poorly supported branches concentrated near the tips of the tree, where data are sparse and the limiting effects of this sparseness are stronger. We argue that limits to phylogenetic inference and signal imposed by low data coverage may pose significant challenges for comprehensive phylogenetic inference at the species level. Computational requirements provide additional limits for large reconstructions, but these may be overcome by methodological advances, whereas insufficient data coverage can only be remedied by additional sampling effort. We conclude that public databases have exceptional value for modern systematics and evolutionary biology, and that a continued emphasis on expanding taxonomic and genomic coverage will play a critical role in developing these resources to their full potential.

[1]  I. Guarniero How Many Species Are There on Earth and in the Ocean? (PLOS Biology) , 2014 .

[2]  Alexandros Stamatakis,et al.  Novel information theory-based measures for quantifying incongruence among phylogenetic trees. , 2014, Molecular biology and evolution.

[3]  Bernhard Misof,et al.  Selecting informative subsets of sparse supermatrices increases the chance to find correct trees , 2013, BMC Bioinformatics.

[4]  Joseph W. Brown,et al.  Analyzing and Synthesizing Phylogenies Using Tree Alignment Graphs , 2013, PLoS Comput. Biol..

[5]  A. Vogler,et al.  Resolving ambiguity of species limits and concatenation in multilocus sequence data for the construction of phylogenetic supermatrices. , 2013, Systematic biology.

[6]  Cody E. Hinchliff,et al.  Using supermatrices for phylogenetic inquiry: an example using the sedges. , 2013, Systematic biology.

[7]  K. Hilu,et al.  Land plant evolutionary timeline: gene effects are secondary to fossil constraints in relaxed clock estimation of age and substitution rates. , 2013, American journal of botany.

[8]  Alexandros Stamatakis,et al.  Pruning Rogue Taxa Improves Phylogenetic Accuracy: An Efficient Algorithm and Webservice , 2012, Systematic biology.

[9]  K. Gardens The Plant List , 2013 .

[10]  Brian C. O'Meara,et al.  treePL: divergence time estimation using penalized likelihood for large phylogenies , 2012, Bioinform..

[11]  M. Donoghue,et al.  Hemisphere-scale differences in conifer evolutionary dynamics , 2012, Proceedings of the National Academy of Sciences.

[12]  Alexandros Stamatakis,et al.  RAxML-Light: a tool for computing terabyte phylogenies , 2012, Bioinform..

[13]  C. Delwiche,et al.  Broad Phylogenomic Sampling and the Sister Lineage of Land Plants , 2012, PloS one.

[14]  Mike A. Steel,et al.  ‘Lassoing’ a phylogenetic tree I: basic properties, shellings, and covers , 2012, Journal of mathematical biology.

[15]  T. Hodkinson,et al.  New grass phylogeny resolves deep evolutionary relationships and discovers C4 origins. , 2012, The New phytologist.

[16]  C. Mora,et al.  How Many Species Are There on Earth and in the Ocean? , 2011, PLoS biology.

[17]  S. Graham,et al.  Inferring the higher-order phylogeny of mosses (Bryophyta) and relatives using a large, multigene plastid data set. , 2011, American journal of botany.

[18]  D. E. Soltis,et al.  Angiosperm phylogeny: 17 genes, 640 taxa. , 2011, American journal of botany.

[19]  Alexandros Stamatakis,et al.  Understanding Angiosperm Diversification Using Small and Large Phylogenetic Trees 1 , 2022 .

[20]  C. Delwiche,et al.  Multigene Phylogeny of the Green Lineage Reveals the Origin and Diversification of Land Plants , 2010, Current Biology.

[21]  J. G. Burleigh,et al.  Assembling the Angiosperm Tree of Life: Progress and Future Prospects , 2010 .

[22]  Mike Steel,et al.  Phylogenomics with incomplete taxon coverage: the limits to inference , 2010, BMC Evolutionary Biology.

[23]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[24]  J. G. Burleigh,et al.  Phylogenetic analysis of 83 plastid genes further resolves the early diversification of eudicots , 2010, Proceedings of the National Academy of Sciences.

[25]  Stephen A. Smith,et al.  Phylogenetic analyses reveal the shady history of C4 grasses , 2010, Proceedings of the National Academy of Sciences.

[26]  Mike A. Steel,et al.  Characterizing phylogenetically decisive taxon coverage , 2010, Appl. Math. Lett..

[27]  Robert C Thomson,et al.  Sparse supermatrices for phylogenetic inference: taxonomy, alignment, rogue taxa, and the phylogeny of living turtles. , 2010, Systematic biology.

[28]  Steven J. M. Jones,et al.  Circos: an information aesthetic for comparative genomics. , 2009, Genome research.

[29]  M. Donoghue,et al.  Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches , 2009, BMC Evolutionary Biology.

[30]  M. Donoghue,et al.  Rates of Molecular Evolution Are Linked to Life History in Flowering Plants , 2008, Science.

[31]  Michael J Sanderson,et al.  Phylogenetic Signal in the Eukaryotic Tree of Life , 2008, Science.

[32]  Ncbi National Center for Biotechnology Information , 2008 .

[33]  Pamela S Soltis,et al.  Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms , 2007, Proceedings of the National Academy of Sciences.

[34]  Y. Qiu,et al.  A Nonflowering Land Plant Phylogeny Inferred from Nucleotide Sequences of Seven Chloroplast, Mitochondrial, and Nuclear Genes , 2007, International Journal of Plant Sciences.

[35]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[36]  J. G. Burleigh,et al.  Prospects for Building the Tree of Life from Large Sequence Databases , 2004, Science.

[37]  M. Chase Monocot relationships: an overview. , 2004, American journal of botany.

[38]  R. McCourt,et al.  Green algae and the origin of land plants. , 2004, American journal of botany.

[39]  C. Neinhuis,et al.  Angiosperm phylogeny based on matK sequence information. , 2003, American journal of botany.

[40]  R. Marcucci,et al.  Ornithogalum umbratile (Hyacinthaceae), a new species from Gargano's Promontory, southeastern Italy , 2003 .

[41]  R. Govaerts How many species of seed plants are there? - a response , 2003 .

[42]  C. dePamphilis,et al.  Disintegration of the scrophulariaceae. , 2001, American journal of botany.

[43]  W. Kress,et al.  Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences , 2000 .

[44]  D. Soltis,et al.  Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology , 1999, Nature.

[45]  Mark W. Chase,et al.  The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes , 1999, Nature.

[46]  W. John Kress,et al.  Angiosperm Phylogeny Inferred from 18S Ribosomal DNA Sequences , 1997 .

[47]  James F. Smith Phylogenetics of seed plants : An analysis of nucleotide sequences from the plastid gene rbcL , 1993 .