Evaluating the accuracy of amplicon-based microbiome computational pipelines on simulated human gut microbial communities

BackgroundMicrobiome studies commonly use 16S rRNA gene amplicon sequencing to characterize microbial communities. Errors introduced at multiple steps in this process can affect the interpretation of the data. Here we evaluate the accuracy of operational taxonomic unit (OTU) generation, taxonomic classification, alpha- and beta-diversity measures for different settings in QIIME, MOTHUR and a pplacer-based classification pipeline, using a novel software package: DECARD.ResultsIn-silico we generated 100 synthetic bacterial communities approximating human stool microbiomes to be used as a gold-standard for evaluating the colligative performance of microbiome analysis software. Our synthetic data closely matched the composition and complexity of actual healthy human stool microbiomes. Genus-level taxonomic classification was correctly done for only 50.4–74.8% of the source organisms. Miscall rates varied from 11.9 to 23.5%. Species-level classification was less successful, (6.9–18.9% correct); miscall rates were comparable to those of genus-level targets (12.5–26.2%). The degree of miscall varied by clade of organism, pipeline and specific settings used. OTU generation accuracy varied by strategy (closed, de novo or subsampling), reference database, algorithm and software implementation. Shannon diversity estimation accuracy correlated generally with OTU-generation accuracy. Beta-diversity estimates with Double Principle Coordinate Analysis (DPCoA) were more robust against errors introduced in processing than Weighted UniFrac. The settings suggested in the tutorials were among the worst performing in all outcomes tested.ConclusionsEven when using the same classification pipeline, the specific OTU-generation strategy, reference database and downstream analysis methods selection can have a dramatic effect on the accuracy of taxonomic classification, and alpha- and beta-diversity estimation. Even minor changes in settings adversely affected the accuracy of the results, bringing them far from the best-observed result. Thus, specific details of how a pipeline is used (including OTU generation strategy, reference sets, clustering algorithm and specific software implementation) should be specified in the methods section of all microbiome studies. Researchers should evaluate their chosen pipeline and settings to confirm it can adequately answer the research question rather than assuming the tutorial or standard-operating-procedure settings will be adequate or optimal.

[1]  John G Kenny,et al.  A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling , 2016, BMC Genomics.

[2]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[3]  J. Bunge,et al.  Polymerase chain reaction primers miss half of rRNA microbial diversity , 2009, The ISME Journal.

[4]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[5]  Jeanne M. Marrazzo,et al.  Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria , 2012, PloS one.

[6]  Jennifer M. Fettweis,et al.  The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies , 2015, BMC Microbiology.

[7]  K. Eric Wommack,et al.  Groundtruthing Next-Gen Sequencing for Microbial Ecology–Biases and Errors in Community Structure Estimates from PCR Amplicon Pyrosequencing , 2012, PloS one.

[8]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[9]  Rob Knight,et al.  PyNAST: a flexible tool for aligning sequences to a template alignment , 2009, Bioinform..

[10]  Frédéric Mahé,et al.  Swarm: robust and fast clustering method for amplicon-based studies , 2014, PeerJ.

[11]  Patrick D. Schloss,et al.  Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis , 2011, Applied and Environmental Microbiology.

[12]  F. Bushman,et al.  QIIME allows integration and analysis of high-throughput community sequencing data. Nat. Meth. , 2010 .

[13]  Dan Knights,et al.  Systematic improvement of amplicon marker gene methods for increased accuracy in microbiome studies , 2016, Nature Biotechnology.

[14]  Rob Knight,et al.  Secondary structure information does not improve OTU assignment for partial 16s rRNA sequences , 2012, The ISME Journal.

[15]  Ross A. Overbeek,et al.  The RDP (Ribosomal Database Project) , 1997, Nucleic Acids Res..

[16]  Hans Bisgaard,et al.  Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies , 2016, Microbiome.

[17]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[18]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[19]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[20]  Pelin Yilmaz,et al.  Phylogeny-aware identification and correction of taxonomically mislabeled sequences , 2016, bioRxiv.

[21]  B L Maidak,et al.  The RDP-II (Ribosomal Database Project) , 2001, Nucleic Acids Res..

[22]  Antonio Gonzalez,et al.  Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences , 2014, PeerJ.

[23]  A. Klindworth,et al.  Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies , 2012, Nucleic acids research.

[24]  Mihai Pop,et al.  Alignment and clustering of phylogenetic markers - implications for microbial diversity studies , 2010, BMC Bioinformatics.

[25]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[26]  Nicholas A. Bokulich,et al.  mockrobiota: a Public Resource for Microbiome Bioinformatics Benchmarking , 2016, mSystems.

[27]  Christian von Mering,et al.  Limits to robustness and reproducibility in the demarcation of operational taxonomic units. , 2015, Environmental microbiology.

[28]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[29]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[30]  Anne Chao,et al.  Unifying Species Diversity, Phylogenetic Diversity, Functional Diversity, and Related Similarity and Differentiation Measures Through Hill Numbers , 2014 .

[31]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[32]  D. Chessel,et al.  From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. , 2004, Journal of theoretical biology.

[33]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[34]  Patrick D. Schloss,et al.  Reducing the Effects of PCR Amplification and Sequencing Artifacts on 16S rRNA-Based Studies , 2011, PloS one.

[35]  Patrick D. Schloss,et al.  The Effects of Alignment Quality, Distance Calculation Method, Sequence Filtering, and Region on the Analysis of 16S rRNA Gene-Based Studies , 2010, PLoS Comput. Biol..