RefSeq database growth influences the accuracy of k-mer-based species identification

Accurate species-level taxonomic classification and profiling of complex microbial communities remains a challenge due to homologous regions shared among closely related species and a sparse representation of non-human associated microbes in the database. Although the database undoubtedly has a strong influence on the sensitivity of taxonomic classifiers and profilers, to date, no study has carefully explored this topic on historical RefSeq releases and explored its impact on accuracy. In this study, we examined the influence of the database, over time, on k-mer based sequence classification and profiling. We present three major findings: (i) database growth over time resulted in more classified reads, but fewer species-level classifications and more species-level misclassifications; (ii) Bayesian re-estimation of abundance helped to recover species-level classifications when the exact target strain was present; and (iii) Bayesian reestimation struggled when the database lacked the target strain, resulting in a notable decrease in accuracy. In summary, our findings suggest that the growth of RefSeq over time has strongly influenced the accuracy of k-mer based classification and profiling methods, resulting in different classification results depending on the particular database used. These results suggest a need for new algorithms specially adapted for large genome collections and better measures of classification uncertainty.

[1]  Andrew C. Stewart,et al.  Genomic characterization of the Bacillus cereus sensu lato species: Backdrop to the evolution of Bacillus anthracis , 2012, Genome research.

[2]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[3]  Steven L. Salzberg,et al.  Unexpected cross-species contamination in genome sequencing projects , 2014, PeerJ.

[4]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[5]  Sergey Koren,et al.  Draft Genome Sequences from a Novel Clade of Bacillus cereus Sensu Lato Strains, Isolated from the International Space Station , 2017, Genome Announcements.

[6]  Bernardo J. Clavijo,et al.  Rapid transcriptional plasticity of duplicated gene clusters enables a clonally reproducing aphid to colonise diverse plant species , 2017, Genome Biology.

[7]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[8]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[9]  F. Cohan What are bacterial species? , 2002, Annual review of microbiology.

[10]  Alejandro A. Schäffer,et al.  VecScreen_plus_taxonomy: imposing a tax(onomy) increase on vector contamination screening , 2018, Bioinform..

[11]  M. Mock,et al.  The incompatibility between the PlcR‐ and AtxA‐controlled regulons may have selected a nonsense mutation in Bacillus anthracis , 2001, Molecular microbiology.

[12]  Sarah A. Butcher,et al.  k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets , 2016, Nucleic acids research.

[13]  Christophe Boesch,et al.  The Genome of a Bacillus Isolate Causing Anthrax in Chimpanzees Combines Chromosomal Properties of B. cereus with B. anthracis Virulence Plasmids , 2010, PloS one.

[14]  Jan Paul Medema,et al.  Betulin Is a Potent Anti-Tumor Agent that Is Enhanced by Cholesterol , 2009, PloS one.

[15]  Mihai Pop,et al.  TIPP: taxonomic identification and phylogenetic profiling , 2014, Bioinform..

[16]  Yeisoo Yu,et al.  Uncovering the novel characteristics of Asian honey bee, Apis cerana, by whole genome sequencing , 2015, BMC Genomics.

[17]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[18]  Sharon I. Greenblum,et al.  Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease , 2011, Proceedings of the National Academy of Sciences.

[19]  Noah Alexander,et al.  Geospatial Resolution of Human and Bacterial Diversity with City-Scale Metagenomics , 2015, Cell systems.

[20]  Anne-Brit Kolstø,et al.  Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis—One Species on the Basis of Genetic Evidence , 2000, Applied and Environmental Microbiology.

[21]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.

[22]  Julia Oh,et al.  ReprDB and panDB: minimalist databases with maximal microbial representation , 2017, Microbiome.

[23]  Emmanuel Dias-Neto,et al.  The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium inaugural meeting report , 2016, Microbiome.

[24]  Paul Keim,et al.  Anthrax molecular epidemiology and forensics: using the appropriate marker for different evolutionary scales. , 2004, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[25]  J. Armengaud,et al.  The importance of recognizing and reporting sequence database contamination for proteomics , 2014 .

[26]  J. Kawai,et al.  Direct Metagenomic Detection of Viral Pathogens in Nasal and Fecal Specimens Using an Unbiased High-Throughput Sequencing Approach , 2009, PloS one.

[27]  Stefano Lonardi,et al.  Comprehensive benchmarking and ensemble approaches for metagenomic classifiers , 2017, Genome Biology.

[28]  Steven Salzberg,et al.  Bracken: Estimating species abundance in metagenomics data , 2016, bioRxiv.