Denoising the Denoisers: an independent evaluation of microbiome sequence error-correction approaches

High-depth sequencing of universal marker genes such as the 16S rRNA gene is a common strategy to profile microbial communities. Traditionally, sequence reads are clustered into operational taxonomic units (OTUs) at a defined identity threshold to avoid sequencing errors generating spurious taxonomic units. However, there have been numerous bioinformatic packages recently released that attempt to correct sequencing errors to determine real biological sequences at single nucleotide resolution by generating amplicon sequence variants (ASVs). As more researchers begin to use high resolution ASVs, there is a need for an in-depth and unbiased comparison of these novel “denoising” pipelines. In this study, we conduct a thorough comparison of three of the most widely-used denoising packages (DADA2, UNOISE3, and Deblur) as well as an open-reference 97% OTU clustering pipeline on mock, soil, and host-associated communities. We found from the mock community analyses that although they produced similar microbial compositions based on relative abundance, the approaches identified vastly different numbers of ASVs that significantly impact alpha diversity metrics. Our analysis on real datasets using recommended settings for each denoising pipeline also showed that the three packages were consistent in their per-sample compositions, resulting in only minor differences based on weighted UniFrac and Bray–Curtis dissimilarity. DADA2 tended to find more ASVs than the other two denoising pipelines when analyzing both the real soil data and two other host-associated datasets, suggesting that it could be better at finding rare organisms, but at the expense of possible false positives. The open-reference OTU clustering approach identified considerably more OTUs in comparison to the number of ASVs from the denoising pipelines in all datasets tested. The three denoising approaches were significantly different in their run times, with UNOISE3 running greater than 1,200 and 15 times faster than DADA2 and Deblur, respectively. Our findings indicate that, although all pipelines result in similar general community structure, the number of ASVs/OTUs and resulting alpha-diversity metrics varies considerably and should be considered when attempting to identify rare organisms from possible background noise.

[1]  J. Roach,et al.  A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of the gut microbiome , 2017, BMC Microbiology.

[2]  S. Grandy,et al.  Moderate Exercise Has Limited but Distinguishable Effects on the Mouse Microbiome , 2017, mSystems.

[3]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[4]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[5]  Richard Hansen,et al.  Multi-omics differentially classify disease state and treatment outcome in pediatric Crohn’s disease , 2018, Microbiome.

[6]  Ole Tange,et al.  GNU Parallel 20150322 ('Hellwig') , 2015 .

[7]  Michael Weiss,et al.  Towards a unified paradigm for sequence‐based identification of fungi , 2013, Molecular ecology.

[8]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[9]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[10]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[11]  Robert C. Edgar,et al.  UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing , 2016, bioRxiv.

[12]  David C. Percival,et al.  Variation in Bacterial and Eukaryotic Communities Associated with Natural and Managed Wild Blueberry Habitats , 2017 .

[13]  Ole Tange,et al.  GNU Parallel: The Command-Line Power Tool , 2011, login Usenix Mag..

[14]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  Johan Van Limbergen,et al.  Additional file 1: of Multi-omics differentially classify disease state and treatment outcome in pediatric Crohn’s disease , 2018 .

[17]  Ben Nichols,et al.  VSEARCH: a versatile open source tool for metagenomics , 2016, PeerJ.

[18]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[19]  Jose A Navas-Molina,et al.  Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns , 2017, mSystems.

[20]  Robert C. Edgar,et al.  Accuracy of microbial community diversity estimated by closed- and open-reference OTUs , 2017, PeerJ.

[21]  Tandy J. Warnow,et al.  SEPP: SATe -Enabled Phylogenetic Placement , 2011, Pacific Symposium on Biocomputing.

[22]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[23]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[24]  G. Douglas,et al.  Microbiome Helper: a Custom and Streamlined Workflow for Microbiome Research , 2017, mSystems.

[25]  R. B. Jackson,et al.  The diversity and biogeography of soil bacterial communities. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[26]  E. Plummer,et al.  A Comparison of Three Bioinformatics Pipelines for the Analysis ofPreterm Gut Microbiota using 16S rRNA Gene Sequencing Data , 2015 .

[27]  Nicholas A. Bokulich,et al.  mockrobiota: a Public Resource for Microbiome Bioinformatics Benchmarking , 2016, mSystems.

[28]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[29]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[30]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[31]  Paul J. McMurdie,et al.  Exact sequence variants should replace operational taxonomic units in marker-gene data analysis , 2017, The ISME Journal.

[32]  Gregory B Gloor,et al.  Expanding the UniFrac Toolbox , 2016, PloS one.