IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences

BackgroundMicrobiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many microbial groups. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of “over classification” is particularly detrimental in microbiome studies because reference taxonomies are far from comprehensive.ResultsHere, we introduce IDTAXA, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors. Using multiple reference taxonomies, we demonstrate that IDTAXA has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier. Similarly, IDTAXA yields far fewer over classifications on Illumina mock microbial community data when the expected taxa are absent from the training set. Furthermore, IDTAXA offers many practical advantages over other classifiers, such as maintaining low error rates across varying input sequence lengths and withholding classifications from input sequences composed of random nucleotides or repeats.ConclusionsIDTAXA’s classifications may lead to different conclusions in microbiome studies because of the substantially reduced number of taxa that are incorrectly identified through over classification. Although misclassification error is relatively minor, we believe that many remaining misclassifications are likely caused by errors in the reference taxonomy. We describe how IDTAXA is able to identify many putative mislabeling errors in reference taxonomies, enabling training sets to be automatically corrected by eliminating spurious sequences. IDTAXA is part of the DECIPHER package for the R programming language, available through the Bioconductor repository or accessible online (http://DECIPHER.codes).

[1]  Aleksandra Tarkowska,et al.  Benchmarking taxonomic assignments based on 16S rRNA gene profiling of the microbiota from commonly sampled environments , 2018, GigaScience.

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  Mads Albertsen,et al.  Retrieval of a million high-quality, full-length microbial 16S and 18S rRNA gene sequences without primer bias , 2018, Nature Biotechnology.

[4]  Jonathan L. Golob,et al.  Evaluating the accuracy of amplicon-based microbiome computational pipelines on simulated human gut microbial communities , 2017, BMC Bioinformatics.

[5]  Erik S. Wright,et al.  Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R , 2016, R J..

[6]  C. Huttenhower,et al.  PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes , 2013, Nature Communications.

[7]  Kuan-Liang Liu,et al.  Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes , 2011, Applied and Environmental Microbiology.

[8]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[9]  M. Hahn,et al.  Complete ecological isolation and cryptic diversity in Polynucleobacter bacteria not resolved by 16S rRNA gene sequences , 2016, The ISME Journal.

[10]  Robert C. Edgar,et al.  SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences , 2016, bioRxiv.

[11]  F. Ryan,et al.  SPINGO: a rapid species-classifier for microbial amplicon sequences , 2015, BMC Bioinformatics.

[12]  Benjamin D. Kaehler,et al.  Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin , 2018, Microbiome.

[13]  Rudolf Amann,et al.  Past and future species definitions for Bacteria and Archaea. , 2015, Systematic and applied microbiology.

[14]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[15]  Mihai Pop,et al.  TIPP: taxonomic identification and phylogenetic profiling , 2014, Bioinform..

[16]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  P. Greenfield,et al.  Fungal identification using a Bayesian classifier and the Warcup training set of internal transcribed spacer sequences , 2016, Mycologia.

[19]  E. Grice,et al.  HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies , 2018, Genome Biology.

[20]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[21]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[22]  Trygve Almøy,et al.  Comparing K-mer based methods for improved classification of 16S sequences , 2015, BMC Bioinformatics.

[23]  Donovan H. Parks,et al.  Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life , 2017, Nature Microbiology.

[24]  A. Clooney,et al.  16S rRNA gene sequencing of mock microbial populations- impact of DNA extraction method, primer choice and sequencing platform , 2016, BMC Microbiology.

[25]  Robert C. Edgar,et al.  Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences , 2018, PeerJ.

[26]  Ruth Nussinov,et al.  How can computation advance microbiome research? , 2017, PLoS Comput. Biol..

[27]  Rajesh N. Davé,et al.  Characterization and detection of noise in clustering , 1991, Pattern Recognit. Lett..

[28]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[29]  D. Stien,et al.  Multiple Streptomyces species with distinct secondary metabolomes have identical 16S rRNA gene sequences , 2017, Scientific Reports.

[30]  Sophie S Abby,et al.  Lateral gene transfer as a support for the tree of life , 2012, Proceedings of the National Academy of Sciences.

[31]  Robin R. Rohwer,et al.  TaxAss: Leveraging a Custom Freshwater Database Achieves Fine-Scale Taxonomic Resolution , 2018, mSphere.

[32]  Hilde Vinje,et al.  microclass: an R-package for 16S taxonomy classification , 2017, BMC Bioinformatics.

[33]  Nick Goldman,et al.  PhyloSim - Monte Carlo simulation of sequence evolution in the R statistical computing environment , 2011, BMC Bioinformatics.

[34]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[35]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[36]  Christian von Mering,et al.  MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis , 2017, Bioinform..

[37]  Natalia N. Ivanova,et al.  Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[38]  Julian Parkhill,et al.  Recognizing the reagent microbiome , 2018, Nature Microbiology.

[39]  W. D. de Vos,et al.  Comparative Analysis of Pyrosequencing and a Phylogenetic Microarray for Exploring Microbial Community Structures in the Human Distal Intestine , 2009, PloS one.

[40]  Donovan H. Parks,et al.  A proposal for a standardized bacterial taxonomy based on genome phylogeny , 2018, bioRxiv.

[41]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[42]  Adina Howe,et al.  Strategies to improve reference databases for soil microbiomes , 2016, The ISME Journal.