Taxonomy based performance metrics for evaluating taxonomic assignment methods

BackgroundMetagenomics experiments often make inferences about microbial communities by sequencing 16S and 18S rRNA, and taxonomic assignment is a fundamental step in such studies. This paper addresses the weaknesses in two types of metrics commonly used by previous studies for measuring the performance of existing taxonomic assignment methods: Sequence count based metrics and Binary error measurement. These metrics made performance evaluation results biased, less informative and mutually incomparable.ResultsWe investigated weaknesses in two types of metrics and proposed new performance metrics including Average Taxonomy Distance (ATD) and ATD_by_Taxa, together with the visualized ATD plot.ConclusionsBy comparing the evaluation results from four popular taxonomic assignment methods across three test data sets, we found the new metrics more robust, informative and comparable.

[1]  Hayssam Soueidan,et al.  Machine learning for metagenomics: methods and tools , 2015, 1510.06621.

[2]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[3]  Vineet K. Sharma,et al.  16S Classifier: A Tool for Fast and Accurate Taxonomic Classification of 16S rRNA Hypervariable Regions in Metagenomic Datasets , 2015, PloS one.

[4]  Francisco J. Valverde-Albacete,et al.  100% Classification Accuracy Considered Harmful: The Normalized Information Transfer Factor Explains the Accuracy Paradox , 2014, PloS one.

[5]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[6]  D. Huson,et al.  SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? , 2017, BMC Genomics.

[7]  Trygve Almøy,et al.  Comparing K-mer based methods for improved classification of 16S sequences , 2015, BMC Bioinformatics.

[8]  J. Pei,et al.  Advanced Pattern Mining , 2012 .

[9]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[10]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[12]  Alice Carolyn McHardy,et al.  Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods , 2014, Bioinform..

[13]  George Forman,et al.  Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement , 2010, SKDD.

[14]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[15]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[16]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[17]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[18]  R. Beiko Microbial malaise: how can we classify the microbiome? , 2015, Trends in microbiology.

[19]  Georgios Paliouras,et al.  Evaluation measures for hierarchical classification: a unified view and novel approaches , 2013, Data Mining and Knowledge Discovery.

[20]  Dhiya Al-Jumeily,et al.  Exploring the Hidden Challenges Associated with the Evaluation of Multi-class Datasets Using Multiple Classifiers , 2014, 2014 Eighth International Conference on Complex, Intelligent and Software Intensive Systems.

[21]  Robert C. Edgar,et al.  SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences , 2016, bioRxiv.

[22]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[23]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[24]  Naryttza N. Diaz,et al.  TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach , 2009, BMC Bioinformatics.