Taxonomic Classification at the Strain Level using a Species-of-Interest $\boldsymbol{k}$-mer Database

Metagenomic shotgun sequencing of microbial environments provides researchers with large volumes of genetic data that can elucidate the complex profile of microbial species in a microbiome. However, the ability to classify these sequences at the strain level is computationally challenging due to a high level of genetic similarity among strains. Furthermore, simply matching a read to a higher taxonomic rank (phylum, genus, species) may provide little to no clinical insight. We introduce an algorithm that classifies metagenomic reads at the strain level using exact matches of $\boldsymbol{k}$-mers against a database of species and strain-level $\boldsymbol{k}$-mers. Comparison of our method to the state of the art, Kraken, using simulated reads shows significant improvement in strain-level sensitivity and precision in the presence of sequencing error rate.

[1]  K. Pollard,et al.  An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography , 2016, Genome research.

[2]  Ahmed A. Metwally,et al.  WEVOTE: Weighted Voting Taxonomic Identification Method of Microbial Sequences , 2016, bioRxiv.

[3]  Ahmed A. Metwally,et al.  Cloud-based solution for improving usability and interactivity of metagenomic ensemble taxonomic classification methods , 2018, 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[4]  Bernhard Y. Renard,et al.  MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling , 2017, bioRxiv.

[5]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[6]  Eugene V Koonin,et al.  Contribution of phage-derived genomic islands to the virulence of facultative bacterial pathogens. , 2013, Environmental microbiology.

[7]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[8]  Stefano Lonardi,et al.  Comprehensive Benchmarking and Ensemble Approaches for Metagenomic Classifiers , 2017 .

[9]  Rebecca Rose,et al.  Flexible design of multiple metagenomics classification pipelines with UGENE , 2018, Bioinform..

[10]  B. Birren,et al.  Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. , 2012, Genome research.

[11]  J. Foster,et al.  Phylogenetically typing bacterial strains from partial SNP genotypes observed from direct sequencing of clinical specimen metagenomic data , 2015, Genome Medicine.

[12]  Chongle Pan,et al.  Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance , 2014, Bioinform..

[13]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[14]  Quinn Snell,et al.  Pathoscope: Species identification and strain attribution with unassembled sequencing data , 2013, Genome research.

[15]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[16]  Rob Knight,et al.  ConStrains identifies microbial strains in metagenomic datasets , 2015, Nature Biotechnology.

[17]  Mihai Pop,et al.  TIPP: taxonomic identification and phylogenetic profiling , 2014, Bioinform..

[18]  Ying Chen,et al.  High speed BLASTN: an accelerated MegaBLAST search tool , 2015, Nucleic acids research.