NG-Tax 2.0: A Semantic Framework for High-Throughput Amplicon Analysis

NG-Tax 2.0 is a semantic framework for FAIR high-throughput analysis and classification of marker gene amplicon sequences including bacterial and archaeal 16S ribosomal RNA (rRNA), eukaryotic 18S rRNA and ribosomal intergenic transcribed spacer sequences. It can directly use single or merged reads, paired-end reads and unmerged paired-end reads from long range fragments as input to generate de novo amplicon sequence variants (ASV). Using the RDF data model, ASV’s can be automatically stored in a graph database as objects that link ASV sequences with the full data-wise and element-wise provenance, thereby achieving the level of interoperability required to utilize such data to its full potential. The graph database can be directly queried, allowing for comparative analyses of over thousands of samples and is connected with an interactive Rshiny toolbox for analysis and visualization of (meta) data. Additionally, NG-Tax 2.0 exports an extended BIOM 1.0 (JSON) file as starting point for further analyses by other means. The extended BIOM file contains new attribute types to include information about the command arguments used, the sequences of the ASVs formed, classification confidence scores and is backwards compatible. The performance of NG-Tax 2.0 was compared with DADA2, using the plugin in the QIIME 2 analysis pipeline. Fourteen 16S rRNA gene amplicon mock community samples were obtained from the literature and evaluated. Precision of NG-Tax 2.0 was significantly higher with an average of 0.95 vs 0.58 for QIIME2-DADA2 while recall was comparable with an average of 0.85 and 0.77, respectively. NG-Tax 2.0 is written in Java. The code, the ontology, a Galaxy platform implementation, the analysis toolbox, tutorials and example SPARQL queries are freely available at http://wurssb.gitlab.io/ngtax under the MIT License.

[1]  Jacob T. Nearing,et al.  Denoising the Denoisers: an independent evaluation of microbiome sequence error-correction approaches , 2018, PeerJ.

[2]  J. Clemente,et al.  The Long-Term Stability of the Human Gut Microbiota , 2013 .

[3]  Sarah L. Westcott,et al.  Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the MiSeq Illumina Sequencing Platform , 2013, Applied and Environmental Microbiology.

[4]  Edward M. Rubin,et al.  Metagenomics: DNA sequencing of environmental samples , 2005, Nature Reviews Genetics.

[5]  Robert C. Edgar,et al.  Updating the 97% identity threshold for 16S ribosomal RNA OTUs , 2017, bioRxiv.

[6]  Joonhong Park,et al.  Characterization of sequence-specific errors in various next-generation sequencing systems. , 2016, Molecular bioSystems.

[7]  Age K. Smilde,et al.  Real-life metabolomics data analysis : how to deal with complex data ? , 2010 .

[8]  E. Zoetendal,et al.  NG-Tax, a highly accurate and validated pipeline for analysis of 16S rRNA amplicons from complex biomes , 2016, F1000Research.

[9]  Konstantinos T. Konstantinidis,et al.  Towards a Genome-Based Taxonomy for Prokaryotes , 2005, Journal of bacteriology.

[10]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[11]  Mark A. Musen,et al.  The protégé project: a look back and a look forward , 2015, SIGAI.

[12]  Nicholas A. Bokulich,et al.  mockrobiota: a Public Resource for Microbiome Bioinformatics Benchmarking , 2016, mSystems.

[13]  C. Quince,et al.  Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform , 2015, Nucleic acids research.

[14]  Ilya J. Finkelstein,et al.  Indel-correcting DNA barcodes for high-throughput sequencing , 2018, Proceedings of the National Academy of Sciences.

[15]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[16]  Mikhail Tikhonov,et al.  Interpreting 16S metagenomic data without clustering to achieve sub-OTU resolution , 2013, The ISME Journal.

[17]  Erko Stackebrandt,et al.  Taxonomic Note: A Place for DNA-DNA Reassociation and 16S rRNA Sequence Analysis in the Present Species Definition in Bacteriology , 1994 .

[18]  Pelin Yilmaz,et al.  The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks , 2013, Nucleic Acids Res..

[19]  J. Manners,et al.  A perspective. , 2006, Annals of cardiac anaesthesia.

[20]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[21]  Jon Olav Vik,et al.  The Empusa code generator and its application to GBOL, an extendable ontology for genome annotation , 2019, Scientific Data.

[22]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[23]  Susan Holmes,et al.  DADA2: High resolution sample inference from amplicon data , 2015, bioRxiv.

[24]  Luke R. Thompson,et al.  Best practices for analysing microbiomes , 2018, Nature Reviews Microbiology.

[25]  Deborah Fravel,et al.  An assessment of US microbiome research , 2016, Nature Microbiology.

[26]  Mihai Pop,et al.  A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity , 2016, npj Biofilms and Microbiomes.

[27]  Michael W. Hall,et al.  16S rRNA Gene Analysis with QIIME2. , 2018, Methods in molecular biology.

[28]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[29]  Brian C. Thomas,et al.  A new view of the tree of life , 2016, Nature Microbiology.

[30]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[31]  Dieter M. Tourlousse,et al.  Synthetic spike-in standards for high-throughput 16S rRNA gene amplicon sequencing , 2016, Nucleic acids research.

[32]  M. Watson,et al.  The Madness of Microbiome: Attempting To Find Consensus “Best Practice” for 16S Microbiome Studies , 2018, Applied and Environmental Microbiology.

[33]  Dan Knights,et al.  Systematic improvement of amplicon marker gene methods for increased accuracy in microbiome studies , 2016, Nature Biotechnology.

[34]  K. Schleifer,et al.  Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences , 2014, Nature Reviews Microbiology.

[35]  Umer Zeeshan Ijaz,et al.  Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data , 2016, BMC Bioinformatics.

[36]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[37]  Gavin M. Douglas,et al.  Denoising the Denoisers: an independent evaluation of microbiome sequence error-correction approaches , 2018, PeerJ.

[38]  Paul J. McMurdie,et al.  Exact sequence variants should replace operational taxonomic units in marker-gene data analysis , 2017, The ISME Journal.