Fast and Sensitive Classification of Short Metagenomic Reads with SKraken

The major problem when analyzing a metagenomic sample is to taxonomically annotate its reads in order to identify the species and their relative abundances. Many tools have been developed recently, however they are not always adequate for the increasing database volume. In this paper we propose an efficient method, called SKraken, that combines taxonomic tree and k-mers frequency counting. SKraken extracts the most representative k-mers for each species and filter out less representative ones. SKraken is inspired by Kraken, which is one of the state-of-art methods. We compare the performance of SKraken with Kraken on both real and synthetic datasets, and it exhibits a higher classification precision and a faster processing speed. Availability: https://bitbucket.org/marchiori_dev/skraken.

[1]  Matteo Comin,et al.  Clustering of reads with alignment-free measures and quality values , 2014, Algorithms for Molecular Biology.

[2]  Jens Roat Kultima,et al.  Potential of fecal microbiota for early‐stage detection of colorectal cancer , 2014 .

[3]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[4]  Mihai Pop,et al.  Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences , 2011, Genome Biology.

[5]  Matteo Comin,et al.  MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures , 2016, Bioinform..

[6]  Paul P. Gardner,et al.  An evaluation of the accuracy and speed of metagenome analysis tools , 2015 .

[7]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[8]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[9]  Stephen M. Mount,et al.  Insights from GWAS: emerging landscape of mechanisms underlying complex trait disease , 2015, BMC Genomics.

[10]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[11]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[12]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[13]  Matteo Comin,et al.  Whole-Genome Phylogeny by Virtue of Unic Subwords , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[14]  Martin Vingron,et al.  Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts , 2012, Bioinform..

[15]  Matteo Comin,et al.  Beyond Fixed-Resolution Alignment-Free Measures for Mammalian Enhancers Sequence Comparison , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[16]  Matteo Comin,et al.  Metagenomic reads binning with spaced seeds , 2017, Theor. Comput. Sci..

[17]  Taeko Dohi,et al.  Dysbiosis of Salivary Microbiota in Inflammatory Bowel Disease and Its Association With Oral Immunological Biomarkers , 2013, DNA research : an international journal for rapid publication of reports on genes and genomes.

[18]  Zhoujun Li,et al.  Dynamic biclustering of microarray data by multi-objective immune optimization , 2011, BMC Genomics.

[19]  M. Nourani,et al.  Single and multi-subject clustering of flow cytometry data for cell-type identification and anomaly detection , 2016, BMC Medical Genomics.

[20]  Alexander Bockmayr,et al.  Double and multiple knockout simulations for genome-scale metabolic network reconstructions , 2015, Algorithms for Molecular Biology.

[21]  Matteo Comin,et al.  Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns , 2014, BMC Bioinformatics.

[22]  Matteo Comin,et al.  Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes , 2013, PRIB.

[23]  Brian C. Thomas,et al.  Unusual biology across a group comprising more than 15% of domain Bacteria , 2015, Nature.

[24]  Matteo Comin,et al.  SKraken: Fast and Sensitive Classification of Short Metagenomic Reads based on Filtering Uninformative k-mers , 2017, BIOINFORMATICS.

[25]  Matteo Comin,et al.  Fast Entropic Profiler: An Information Theoretic Approach for the Discovery of Patterns in Genomes , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  A. Felczykowska,et al.  Metagenomic approach in the investigation of new bioactive compounds in the marine environment. , 2012, Acta biochimica Polonica.

[27]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[28]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[29]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[30]  Cinzia Pizzi,et al.  Higher recall in metagenomic sequence classification exploiting overlapping reads , 2016, BMC Genomics.

[31]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[32]  Matteo Comin,et al.  On the comparison of regulatory sequences with multiple resolution Entropic Profiles , 2016, BMC Bioinformatics.

[33]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[34]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[35]  Matteo Comin,et al.  Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values , 2016, BMC Medical Genomics.

[36]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[37]  Matteo Comin,et al.  Fast Alignment-free Comparison for Regulatory Sequences using Multiple Resolution Entropic Profiles , 2015, BIOINFORMATICS.

[38]  Wei Yu,et al.  The SOX2 response program in glioblastoma multiforme: an integrated ChIP-seq, expression microarray, and microRNA analysis , 2011, BMC Genomics.

[39]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..