A general species delimitation method with applications to phylogenetic placements

Motivation: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets. Results: We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data. Availability and implementation: The code is freely available at www.exelixis-lab.org/software.html. Contact: Alexandros.Stamatakis@h-its.org Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Jeff R. Powell,et al.  Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data , 2012 .

[2]  Tamra C. Mendelson,et al.  Sexual behaviour: Rapid speciation in an arthropod , 2005, Nature.

[3]  Patrick D. Schloss,et al.  Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis , 2011, Applied and Environmental Microbiology.

[4]  Rob DeSalle,et al.  Integrating DNA barcode data and taxonomic practice: Determination, discovery, and description , 2011, BioEssays : news and reviews in molecular, cellular and developmental biology.

[5]  L. Excoffier,et al.  Estimation of past demographic parameters from the distribution of pairwise differences when the mutation rates vary among sites: application to human mitochondrial DNA. , 1999, Genetics.

[6]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[7]  Andrew G. Stephenson,et al.  Experimental and Molecular Approaches to Plant Biosystematics , 1997 .

[8]  Eric Coissac,et al.  Bioinformatic challenges for DNA metabarcoding of plants and animals , 2012, Molecular ecology.

[9]  Erko Stackebrandt,et al.  Taxonomic Note: A Place for DNA-DNA Reassociation and 16S rRNA Sequence Analysis in the Present Species Definition in Bacteriology , 1994 .

[10]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[11]  R. Knight,et al.  Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers , 2008, Nucleic acids research.

[12]  A. Rambaut,et al.  BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[13]  Yunpeng Cai,et al.  ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time , 2011, Nucleic acids research.

[14]  A. Lambert,et al.  ABGD, Automatic Barcode Gap Discovery for primary species delimitation , 2012, Molecular ecology.

[15]  A. Vogler,et al.  Revisiting the insect mitochondrial molecular clock: the mid-Aegean trench calibration. , 2010, Molecular biology and evolution.

[16]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[17]  T. Barraclough,et al.  Delimiting Species Using Single-Locus Data and the Generalized Mixed Yule Coalescent Approach: A Revised Method and Evaluation on Simulated Data Sets , 2013, Systematic biology.

[18]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[19]  Jan Sauer,et al.  A comparison of DNA‐based methods for delimiting species in a Cretan land snail radiation reveals shortcomings of exclusively molecular taxonomy , 2012, Cladistics : the international journal of the Willi Hennig Society.

[20]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[21]  Joaquín Dopazo,et al.  ETE: a python Environment for Tree Exploration , 2010, BMC Bioinformatics.

[22]  Michael J. Sanderson,et al.  R8s: Inferring Absolute Rates of Molecular Evolution, Divergence times in the Absence of a Molecular Clock , 2003, Bioinform..

[23]  Ting Chen,et al.  Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering , 2011, Bioinform..

[24]  Alfried P. Vogler,et al.  Recent advances in DNA taxonomy , 2007 .

[25]  Mark Blaxter,et al.  Molecular barcodes for soil nematode identification , 2002, Molecular ecology.

[26]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[27]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.

[28]  S. Carranza,et al.  Divergence times and colonization of the Canary Islands by Gallotia lizards. , 2010, Molecular phylogenetics and evolution.

[29]  M. Nachman,et al.  Estimate of the mutation rate per nucleotide in humans. , 2000, Genetics.

[30]  J. W. Sites,et al.  OPERATIONAL CRITERIA FOR DELIMITING SPECIES , 2004 .

[31]  S. Nee,et al.  Phylogenetics and speciation. , 2001, Trends in ecology & evolution.

[32]  Holly M. Bik,et al.  Sequencing our way towards understanding global eukaryotic biodiversity. , 2012, Trends in ecology & evolution.

[33]  Aurélien Miralles,et al.  The integrative future of taxonomy , 2010, Frontiers in Zoology.

[34]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[35]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[36]  B. Rannala,et al.  Bayesian species delimitation using multilocus sequence data , 2010, Proceedings of the National Academy of Sciences.

[37]  K. Nixon,et al.  AN AMPLIFICATION OF THE PHYLOGENETIC SPECIES CONCEPT , 1990 .

[38]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[39]  William A. Walters,et al.  Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample , 2010, Proceedings of the National Academy of Sciences.

[40]  John P Huelsenbeck,et al.  A dirichlet process prior for estimating lineage-specific substitution rates. , 2012, Molecular biology and evolution.

[41]  Joel Cracraft,et al.  Species Concepts and Speciation Analysis , 1983 .

[42]  Michel Sartori,et al.  Toward a DNA Taxonomy of Alpine Rhithrogena (Ephemeroptera: Heptageniidae) Using a Mixed Yule-Coalescent Analysis of Mitochondrial and Nuclear DNA , 2011, PloS one.

[43]  Kevin C. Nixon,et al.  Populations, Genetic Variation, and the Delimitation of Phylogenetic Species , 1992 .

[44]  Xiaoyu Wang,et al.  A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis , 2012, Briefings Bioinform..

[45]  David A. Baum,et al.  Choosing among Alternative "Phylogenetic" Species Concepts , 1995 .

[46]  Alfried P. Vogler,et al.  Conservation Genetics at the Species Boundary , 2000 .

[47]  Douglas W. Yu,et al.  Biodiversity soup: metabarcoding of arthropods for rapid biodiversity assessment and biomonitoring , 2012 .

[48]  Diego Fontaneto,et al.  Independently Evolving Species in Asexual Bdelloid Rotifers , 2007, PLoS biology.

[49]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[50]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[51]  Michael Balke,et al.  Accelerated species inventory on Madagascar using coalescent-based models of species delineation. , 2009, Systematic biology.

[52]  Bryan C. Carstens,et al.  SpedeSTEM: a rapid and accurate method for species delimitation , 2011, Molecular ecology resources.

[53]  Jodi L. Sedlock,et al.  Single-locus species delimitation: a test of the mixed Yule–coalescent model, with an empirical application to Philippine round-leaf bats , 2012, Proceedings of the Royal Society B: Biological Sciences.

[54]  Niles Eldredge,et al.  Phylogenetic Patterns and the Evolutionary Process: Method and Theory in Comparative Biology , 1981 .

[55]  Craig Moritz,et al.  Coalescent-based species delimitation in an integrative taxonomy. , 2012, Trends in ecology & evolution.

[56]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[57]  J. Drake,et al.  Rates of spontaneous mutation. , 1998, Genetics.

[58]  Alfried P Vogler,et al.  Sequence-based species delimitation for the DNA taxonomy of undescribed insects. , 2006, Systematic biology.

[59]  J. Sites,et al.  Delimiting species: a Renaissance issue in systematic biology , 2003 .

[60]  C. Meyer,et al.  DNA Barcoding: Error Rates Based on Comprehensive Sampling , 2005, PLoS biology.

[61]  Bryan C. Carstens,et al.  Species Delimitation Using a Combined Coalescent and Information-Theoretic Approach: An Example from North American Myotis Bats , 2010, Systematic biology.