Automated high throughput animal CO1 metabarcode classification

We introduce a method for assigning names to CO1 metabarcode sequences with confidence scores in a rapid, high-throughput manner. We compiled nearly 1 million CO1 barcode sequences appropriate for classifying arthropods and chordates. Compared to our previous Insecta classifier, the current classifier has more than three times the taxonomic coverage, including outgroups, and is based on almost five times as many reference sequences. Unlike other popular rDNA metabarcoding markers, we show that classification performance is similar across the length of the CO1 barcoding region. We show that the RDP classifier can make taxonomic assignments about 19 times faster than the popular top BLAST hit method and reduce the false positive rate from nearly 100% to 34%. This is especially important in large-scale biodiversity and biomonitoring studies where datasets can become very large and the taxonomic assignment problem is not trivial. We also show that reference databases are becoming more representative of current species diversity but that gaps still exist. We suggest that it would benefit the field as a whole if all investigators involved in metabarocoding studies, through collaborations with taxonomic experts, also planned to barcode representatives of their local biota as a part of their projects.

[1]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[2]  Bruno Nevado,et al.  Comparative performances of DNA barcoding across insect orders , 2010, BMC Bioinformatics.

[3]  Zaid Abdo,et al.  A step toward barcoding life: a model-based, decision-theoretic method to assign genes to preexisting species groups. , 2007, Systematic biology.

[4]  F. Leese,et al.  Corrigendum: Validation and Development of COI Metabarcoding Primers for Freshwater Macroinvertebrate Bioassessment , 2017, Front. Environ. Sci..

[5]  N. Baeshen,et al.  Biological Identifications Through DNA Barcodes , 2012 .

[6]  J. Schultz,et al.  ITS2 Database V: Twice as Much. , 2015, Molecular biology and evolution.

[7]  T. Porter,et al.  Scaling up: A guide to high‐throughput genomic approaches for biodiversity analysis , 2018, Molecular ecology.

[8]  W. Hallwachs,et al.  Environmental DNA Barcode Sequence Capture: Targeted, PCR-free Sequence Capture for Biodiversity Analysis from Bulk Environmental Samples , 2016, bioRxiv.

[9]  Nils Hallenberg,et al.  Preserving accuracy in GenBank , 2008 .

[10]  A. Zhang,et al.  Inferring species membership using DNA sequences with back-propagation neural networks. , 2008, Systematic biology.

[11]  F. Leese,et al.  Validation and Development of COI Metabarcoding Primers for Freshwater Macroinvertebrate Bioassessment , 2017, Front. Environ. Sci..

[12]  R. Henrik Nilsson,et al.  Taxonomic Reliability of DNA Sequences in Public Sequence Databases: A Fungal Perspective , 2006, PloS one.

[13]  Donald Edward,et al.  AusRivAS: using macroinvertebrates to assess ecological condition of rivers in Western Australia , 1999 .

[14]  S. Hammer From a fungal perspective , 2004 .

[15]  Mehrdad Hajibabaei,et al.  Large-Scale Biomonitoring of Remote and Threatened Ecosystems via High-Throughput Sequencing , 2015, PloS one.

[16]  K. McKelvey,et al.  Robust Detection of Rare Species Using Environmental DNA: The Importance of Primer Specificity , 2013, PloS one.

[17]  Tae-Kun Seo,et al.  Classification of Nucleotide Sequences Using Support Vector Machines , 2010, Journal of Molecular Evolution.

[18]  A. Zhang,et al.  FuzzyID2: A software package for large data set species identification via barcoding and metabarcoding using hidden Markov models and fuzzy set methods , 2018, Molecular Ecology Resources.

[19]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[20]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[21]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[22]  Mehrdad Hajibabaei,et al.  Simultaneous assessment of the macrobiome and microbiome in a bulk sample of tropical arthropods through DNA metasystematics , 2014, Proceedings of the National Academy of Sciences.

[23]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[24]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[25]  R. Henrik Nilsson,et al.  A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts , 2015, Microbes and environments.

[26]  G. Brian Golding,et al.  Assigning sequences to species in the absence of large interspecific differences. , 2010, Molecular phylogenetics and evolution.

[27]  Kuan-Liang Liu,et al.  Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes , 2011, Applied and Environmental Microbiology.

[28]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[29]  Kristy Deiner,et al.  Environmental DNA metabarcoding: Transforming how we survey animal and plant communities , 2017, Molecular ecology.

[30]  S. Ball,et al.  DNA barcodes for biosecurity: invasive species identification , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[31]  Q. Wheeler,et al.  Impediments to taxonomy and users of taxonomy: accessibility and impact evaluation , 2011, Cladistics : the international journal of the Willi Hennig Society.

[32]  P. Hebert,et al.  bold: The Barcode of Life Data System (http://www.barcodinglife.org) , 2007, Molecular ecology notes.

[33]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[34]  P. Taberlet,et al.  Towards next‐generation biodiversity assessment using DNA metabarcoding , 2012, Molecular ecology.

[35]  Andy F. S. Taylor,et al.  The UNITE database for molecular identification of fungi--recent updates and future perspectives. , 2010, The New phytologist.

[36]  K. Schleifer,et al.  ARB: a software environment for sequence data. , 2004, Nucleic acids research.

[37]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[38]  D. Baird,et al.  Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier , 2014, Molecular Ecology Resources.