metabaR : an R package for the evaluation and improvement of DNA metabarcoding data quality

DNA metabarcoding is becoming the tool of choice for biodiversity studies across taxa and large-scale environmental gradients. Yet, the artefacts present in metabarcoding datasets often preclude a proper interpretation of ecological patterns. Bioinformatic pipelines removing experimental noise have been designed to address this issue. However, these often only partially target produced artefacts, or are marker specific. In addition, assessments of data curation quality and the appropriateness of filtering thresholds are seldom available in existing pipelines, partly due to the lack of appropriate visualisation tools. Here, we present metabaR, an R package that provides a comprehensive suite of tools to effectively curate DNA metabarcoding data after basic bioinformatic analyses. In particular, metabaR uses experimental negative or positive controls to identify different types of artefactual sequences, i.e. reagent contaminants and tag-jumps. It also flags potentially dysfunctional PCRs based on PCR replicate similarities when those are available. Finally, metabaR provides tools to visualise DNA metabarcoding data characteristics in their experimental context as well as their distribution, and facilitate assessment of the appropriateness of data curation filtering thresholds. metabaR is applicable to any DNA metabarcoding experimental design but is most powerful when the design includes experimental controls and replicates. More generally, the simplicity and flexibility of the package makes it applicable any DNA marker, and data generated with any sequencing platform, and pre-analysed with any bioinformatic pipeline. Its outputs are easily usable for downstream analyses with any ecological R package. metabaR complements existing bioinformatics pipelines by providing scientists with a variety of functions with customisable methods that will allow the user to effectively clean DNA metabarcoding data and avoid serious misinterpretations. It thus offers a promising platform for automatised data quality assessments of DNA metabarcoding data for environmental research and biomonitoring.

[1]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[2]  Kristy Deiner,et al.  Environmental DNA metabarcoding: Transforming how we survey animal and plant communities , 2017, Molecular ecology.

[3]  C. Wilke Streamlined Plot Theme and Plot Annotations for 'ggplot2' , 2015 .

[4]  Sandrine Pavoine,et al.  adiv: An r package to analyse biodiversity in ecology , 2020, Methods in Ecology and Evolution.

[5]  Anne Chao,et al.  Unifying Species Diversity, Phylogenetic Diversity, Functional Diversity, and Related Similarity and Differentiation Measures Through Hill Numbers , 2014 .

[6]  P. Legendre,et al.  vegan : Community Ecology Package. R package version 1.8-5 , 2007 .

[7]  I. Good THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS , 1953 .

[8]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[9]  H. H. Bruun,et al.  Algorithm for post-clustering curation of DNA amplicon data yields reliable biodiversity estimates , 2017, Nature Communications.

[10]  Philippe Esling,et al.  Accurate multiplexing and filtering for high-throughput amplicon-sequencing , 2015, Nucleic acids research.

[11]  J. Pawlowski,et al.  Ecosystems monitoring powered by environmental genomics: A review of current strategies with an implementation roadmap , 2020, Molecular ecology.

[12]  W. Thuiller,et al.  From environmental DNA sequences to ecological conclusions: How strong is the influence of methodological choices? , 2019, Journal of Biogeography.

[13]  Pierre Taberlet,et al.  Inferring neutral biodiversity parameters using environmental DNA data sets , 2016, Scientific Reports.

[14]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[15]  Noah Fierer,et al.  DNA metabarcoding—Need for robust experimental designs to draw sound ecological conclusions , 2019, Molecular ecology.

[16]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[17]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[18]  M. Hill Diversity and Evenness: A Unifying Notation and Its Consequences , 1973 .

[19]  Robert C. Edgar UNCROSS: Filtering of high-frequency cross-talk in 16S amplicon reads , 2016, bioRxiv.

[20]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[21]  Kristine Bohmann,et al.  Tag jumps illuminated – reducing sequence‐to‐sample misidentifications in metabarcoding studies , 2015, Molecular ecology resources.

[22]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[23]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[24]  P. Taberlet,et al.  obitools: a unix‐inspired software package for DNA metabarcoding , 2016, Molecular ecology resources.

[25]  Rob Knight,et al.  QIIME 2 Enables Comprehensive End‐to‐End Analysis of Diverse Microbiome Data and Comparative Studies with Publicly Available Data , 2020, Current protocols in bioinformatics.

[26]  P. Taberlet,et al.  Environmental DNA: For Biodiversity Research and Monitoring , 2018 .

[27]  Antton Alberdi,et al.  A guide to the application of Hill numbers to DNA‐based diversity analyses , 2019, Molecular ecology resources.

[28]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .