Performance of Microbiome Sequence Inference Methods in Environments with Varying Biomass

Microbial communities have important ramifications for human health, but determining their impact requires accurate characterization. Current technology makes microbiome sequence data more accessible than ever. However, popular software methods for analyzing these data are based on algorithms developed alongside older sequencing technology and smaller data sets and thus may not be adequate for modern, high-throughput data sets. Additionally, samples from environments where microbes are scarce present additional challenges to community characterization relative to high-biomass environments, an issue that is often ignored. We found that a new class of microbiome sequence processing tools, called amplicon sequence variant (ASV) methods, outperformed conventional methods. In samples representing low-biomass communities, where sample contamination becomes a significant confounding factor, the improved accuracy of ASV methods may allow more-robust computational identification of contaminants. ABSTRACT Microbiome community composition plays an important role in human health, and while most research to date has focused on high-microbial-biomass communities, low-biomass communities are also important. However, contamination and technical noise make determining the true community signal difficult when biomass levels are low, and the influence of varying biomass on sequence processing methods has received little attention. Here, we benchmarked six methods that infer community composition from 16S rRNA sequence reads, using samples of varying biomass. We included two operational taxonomic unit (OTU) clustering algorithms, one entropy-based method, and three more-recent amplicon sequence variant (ASV) methods. We first compared inference results from high-biomass mock communities to assess baseline performance. We then benchmarked the methods on a dilution series made from a single mock community—samples that varied only in biomass. ASVs/OTUs inferred by each method were classified as representing expected community, technical noise, or contamination. With the high-biomass data, we found that the ASV methods had good sensitivity and precision, whereas the other methods suffered in one area or in both. Inferred contamination was present only in small proportions. With the dilution series, contamination represented an increasing proportion of the data from the inferred communities, regardless of the inference method used. However, correlation between inferred contaminants and sample biomass was strongest for the ASV methods and weakest for the OTU methods. Thus, no inference method on its own can distinguish true community sequences from contaminant sequences, but ASV methods provide the most accurate characterization of community and contaminants. IMPORTANCE Microbial communities have important ramifications for human health, but determining their impact requires accurate characterization. Current technology makes microbiome sequence data more accessible than ever. However, popular software methods for analyzing these data are based on algorithms developed alongside older sequencing technology and smaller data sets and thus may not be adequate for modern, high-throughput data sets. Additionally, samples from environments where microbes are scarce present additional challenges to community characterization relative to high-biomass environments, an issue that is often ignored. We found that a new class of microbiome sequence processing tools, called amplicon sequence variant (ASV) methods, outperformed conventional methods. In samples representing low-biomass communities, where sample contamination becomes a significant confounding factor, the improved accuracy of ASV methods may allow more-robust computational identification of contaminants.

[1]  Patrice D Cani,et al.  Interaction between obesity and the gut microbiota: relevance in nutrition. , 2011, Annual review of nutrition.

[2]  Gavin M. Douglas,et al.  Denoising the Denoisers: an independent evaluation of microbiome sequence error-correction approaches , 2018, PeerJ.

[3]  Ben Nichols,et al.  VSEARCH: a versatile open source tool for metagenomics , 2016, PeerJ.

[4]  William A. Walters,et al.  Improved Bacterial 16S rRNA Gene (V4 and V4-5) and Fungal Internal Transcribed Spacer Marker Gene Primers for Microbial Community Surveys , 2015, mSystems.

[5]  Sarah L. Westcott,et al.  Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the MiSeq Illumina Sequencing Platform , 2013, Applied and Environmental Microbiology.

[6]  F. Servant,et al.  Comprehensive description of blood microbiome from healthy donors assessed by 16S targeted metagenomic sequencing , 2016, Transfusion.

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[9]  J. Erb-Downward,et al.  The role of the bacterial microbiome in lung disease , 2013, Expert review of respiratory medicine.

[10]  B. Bonaz,et al.  Brain-gut-microbiota axis in Parkinson's disease. , 2015, World journal of gastroenterology.

[11]  Patrick D. Schloss,et al.  Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis , 2011, Applied and Environmental Microbiology.

[12]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[13]  M. Sogin,et al.  Minimum entropy decomposition: Unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences , 2014, The ISME Journal.

[14]  Paul J. McMurdie,et al.  Exact sequence variants should replace operational taxonomic units in marker-gene data analysis , 2017, The ISME Journal.

[15]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[16]  D. Relman,et al.  Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data , 2017, Microbiome.

[17]  O. Kuipers,et al.  N-acetylgalatosamine-Mediated Regulation of the aga Operon by AgaR in Streptococcus pneumoniae , 2016, Front. Cell. Infect. Microbiol..

[18]  J. Fuhrman,et al.  Every base matters: assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. , 2016, Environmental microbiology.

[19]  Scot E. Dowd,et al.  Inherent bacterial DNA contamination of extraction and sequencing reagents may affect interpretation of microbiota in low bacterial biomass samples , 2016, Gut Pathogens.

[20]  S. Dowd,et al.  Changes in 16s RNA Gene Microbial Community Profiling by Concentration of Prokaryotic DNA. , 2015, Journal of microbiological methods.

[21]  D. Fair,et al.  Does the Urinary Microbiome Play a Role in Urgency Urinary Incontinence and Its Severity? , 2016, Front. Cell. Infect. Microbiol..

[22]  J. Schrenzel,et al.  Challenges in the culture-independent analysis of oral and respiratory samples from intubated patients , 2014, Front. Cell. Infect. Microbiol..

[23]  Ryan Hendrickson,et al.  KatharoSeq Enables High-Throughput Microbiome Analysis from Low-Biomass Samples , 2018, mSystems.

[24]  Rob Knight,et al.  PyNAST: a flexible tool for aligning sequences to a template alignment , 2009, Bioinform..

[25]  L. Brubaker,et al.  Evidence of Uncultivated Bacteria in the Adult Female Bladder , 2012, Journal of Clinical Microbiology.

[26]  Jose A Navas-Molina,et al.  Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns , 2017, mSystems.

[27]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[28]  Rob Knight,et al.  Open-Source Sequence Clustering Methods Improve the State Of the Art , 2016, mSystems.

[29]  L. Maignien,et al.  The effects of variable sample biomass on comparative metagenomics. , 2015, Environmental microbiology.

[30]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[31]  R. Parsons,et al.  Minor revision to V4 region SSU rRNA 806R gene primer greatly increases detection of SAR11 bacterioplankton , 2015 .

[32]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[33]  William A. Walters,et al.  Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms , 2012, The ISME Journal.

[34]  Hélène Touzet,et al.  SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data , 2012, Bioinform..

[35]  Ncbi National Center for Biotechnology Information , 2008 .

[36]  Michael J. Zilliox,et al.  The Female Urinary Microbiome: a Comparison of Women with and without Urgency Urinary Incontinence , 2014, mBio.

[37]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[38]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[39]  Patrick K. H. Lee,et al.  The roles of the outdoors and occupants in contributing to a potential pan-microbiome of the built environment: a review , 2016, Microbiome.

[40]  C. Quince,et al.  Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform , 2015, Nucleic acids research.

[41]  William A. Walters,et al.  Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample , 2010, Proceedings of the National Academy of Sciences.

[42]  J. Corander,et al.  The fecal microbiota of irritable bowel syndrome patients differs significantly from that of healthy subjects. , 2007, Gastroenterology.

[43]  Jeff Kline,et al.  Architectural design influences the diversity and structure of the built environment microbiome , 2012, The ISME Journal.

[44]  G. Bacchetta,et al.  Phylogeography of Arenaria balearica L. (Caryophyllaceae): evolutionary history of a disjunct endemic from the Western Mediterranean continental islands , 2016, PeerJ.

[45]  Robert C. Edgar,et al.  UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing , 2016, bioRxiv.

[46]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[47]  John G Kenny,et al.  A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling , 2016, BMC Genomics.