Quantifying uncertainty of taxonomic placement in DNA barcoding and metabarcoding

Summary A crucial step in the use of DNA markers for biodiversity surveys is the assignment of Linnaean taxonomies (species, genus, etc.) to sequence reads. This allows the use of all the information known based on the taxonomic names. Taxonomic placement of DNA barcoding sequences is inherently probabilistic because DNA sequences contain errors, because there is natural variation among sequences within a species, and because reference data bases are incomplete and can have false annotations. However, most existing bioinformatics methods for taxonomic placement either exclude uncertainty, or quantify it using metrics other than probability. In this paper we evaluate the performance of the recently proposed probabilistic taxonomic placement method PROTAX by applying it to both annotated reference sequence data as well as to unknown environmental data. Our four case studies include contrasting taxonomic groups (fungi, bacteria, mammals and insects), variation in the length and quality of the barcoding sequences (from individually Sanger-sequenced sequences to short Illumina reads), variation in the structures and sizes of the taxonomies (800–130 000 species) and variation in the completeness of the reference data bases (representing 15–100% of known species). Our results demonstrate that PROTAX yields essentially unbiased probabilities of taxonomic placement, which means its quantification of species identification uncertainty is reliable. As expected, the accuracy of taxonomic placement increases with increasing coverage of taxonomic and reference sequence data bases, and with increasing ratio of genetic variation among taxonomic levels over within taxonomic levels. We conclude that reliable species-level identification from environmental samples is still challenging and that neglecting identification uncertainty can lead to spurious inference. A key aim for future research is the completion of taxonomic and reference sequence data bases and making these two types of data compatible.

[1]  R. Henrik Nilsson,et al.  Unbiased probabilistic taxonomic classification for DNA barcoding , 2016, Bioinform..

[2]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.

[3]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[4]  Heng Li,et al.  BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[5]  D. Goulson,et al.  Pollinator-friendly management does not increase the diversity of farmland bees and wasps , 2015 .

[6]  T. Pape,et al.  The Greenland entomofauna : an identification manual of insects, spiders and their allies , 2015 .

[7]  Wouter Boomsma,et al.  Statistical assignment of DNA sequences using Bayesian phylogenetics. , 2008, Systematic biology.

[8]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[9]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[10]  Brian C. Thomas,et al.  Unusual biology across a group comprising more than 15% of domain Bacteria , 2015, Nature.

[11]  P. Auvinen,et al.  Combining high-throughput sequencing with fruit body surveys reveals contrasting life-history strategies in fungi , 2013, The ISME Journal.

[12]  James Haile,et al.  Deep Sequencing of Plant and Animal DNA Contained within Traditional Chinese Medicines Reveals Legality Issues and Health Safety Concerns , 2012, PLoS genetics.

[13]  P. Taberlet,et al.  Towards next‐generation biodiversity assessment using DNA metabarcoding , 2012, Molecular ecology.

[14]  M. Thomas P. Gilbert,et al.  Screening mammal biodiversity using DNA from leeches , 2012, Current Biology.

[15]  Juho Rousu,et al.  Meat Processing Plant Microbiome and Contamination Patterns of Cold-Tolerant Bacteria Causing Food Safety and Spoilage Risks in the Manufacture of Vacuum-Packaged Cooked Sausages , 2015, Applied and Environmental Microbiology.

[16]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[17]  N. Baeshen,et al.  Biological Identifications Through DNA Barcodes , 2012 .

[18]  Quantifying uncertainty of taxonomic placement in DNA barcoding and metabarcoding , 2016 .

[19]  M. Thomas P. Gilbert,et al.  Screening mammal biodiversity using DNA from leeches , 2012, Current Biology.

[20]  R. Cruickshank,et al.  Known knowns, known unknowns, unknown unknowns and unknown knowns in DNA barcoding: a comment on Dowton et al. , 2014, Systematic biology.

[21]  R. Hanner,et al.  DNA barcoding detects market substitution in North American seafood , 2008 .

[22]  Adam M. Phillippy,et al.  Interactive metagenomic visualization in a Web browser , 2011, BMC Bioinformatics.

[23]  R. Henrik Nilsson,et al.  Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data , 2013 .

[24]  Ting Chen,et al.  Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering , 2011, Bioinform..

[25]  Jens M. Olesen,et al.  Strong Impact of Temporal Resolution on the Structure of an Ecological Network , 2013, PloS one.

[26]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[27]  M. Mangel,et al.  Deer, predators, and the emergence of Lyme disease , 2012, Proceedings of the National Academy of Sciences.

[28]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[29]  G. B. Golding,et al.  The effect of sampling from subdivided populations on species identification with DNA barcodes using a Bayesian statistical approach. , 2012, Molecular phylogenetics and evolution.

[30]  M. Sallum,et al.  A multi-locus approach to barcoding in the Anopheles strodei subgroup (Diptera: Culicidae) , 2013, Parasites & Vectors.

[31]  J. Herskowitz,et al.  Proceedings of the National Academy of Sciences, USA , 1996, Current Biology.

[32]  Jeremy R. deWaard,et al.  Biological identifications through DNA barcodes , 2003, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[33]  P. Taberlet,et al.  Reconstructing long‐term human impacts on plant communities: an ecological approach based on lake sediment DNA , 2015, Molecular ecology.

[34]  Douglas W. Yu,et al.  Biodiversity soup: metabarcoding of arthropods for rapid biodiversity assessment and biomonitoring , 2012 .

[35]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[36]  C. Lettner,et al.  Establishing a community‐wide DNA barcode library as a new tool for arctic research , 2016, Molecular ecology resources.

[37]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[38]  C. Meyer,et al.  DNA Barcoding: Error Rates Based on Comprehensive Sampling , 2005, PLoS biology.

[39]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[40]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[41]  C Cooper,et al.  Introduction. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences , 1997 .

[42]  D. Janzen,et al.  Wedding biodiversity inventory of a large and complex Lepidoptera fauna with DNA barcoding , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[43]  P. Auvinen,et al.  Identifying wood-inhabiting fungi with 454 sequencing – what is the probability that BLAST gives the correct species? , 2010 .

[44]  Ingrid M. J. Scholtens,et al.  Advances in DNA metabarcoding for food and wildlife forensic species identification , 2016, Analytical and Bioanalytical Chemistry.

[45]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..