CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers

BackgroundThe problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.ResultsWe introduce Clark a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of Clark is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode Clark classifies, with high accuracy, about 32 million metagenomic short reads per minute. Clark can also classify BAC clones or transcripts to chromosome arms and centromeric regions.ConclusionsClark is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  S. Hillier,et al.  The identification of vaginal Lactobacillus species and the demographic and microbiologic characteristics of women colonized by these species. , 1999, The Journal of infectious diseases.

[3]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[4]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[5]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[6]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[7]  Ronald W. Davis,et al.  Microbes on the human vaginal epithelium , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[9]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[10]  Stefano Lonardi,et al.  Development and implementation of high-throughput SNP genotyping in barley , 2009, BMC Genomics.

[11]  Patrick M Hayes,et al.  Construction and application for QTL analysis of a Restriction Site Associated DNA (RAD) linkage map in barley , 2011, BMC Genomics.

[12]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[13]  M. Pop,et al.  Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences , 2011, BMC Genomics.

[14]  S. Salzberg,et al.  PhymmBL expanded: confidence scores, custom databases, parallelization and more , 2011, Nature Methods.

[15]  T. Scheffer,et al.  Taxonomic metagenome sequence assignment with structured output models , 2011, Nature Methods.

[16]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[17]  Michael P. Cummings,et al.  A comparative evaluation of sequence classification programs , 2012, BMC Bioinformatics.

[18]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[19]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[20]  Mihaela M. Martis,et al.  A physical, genetic and functional sequence assembly of the barley genome , 2012, Nature.

[21]  Jan Vrána,et al.  Chromosomes in the flow to simplify genome analysis , 2012, Functional & Integrative Genomics.

[22]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[23]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[24]  J. Chapman,et al.  Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ) , 2013, The Plant journal : for cell and molecular biology.

[25]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[26]  Gianfranco Ciardo,et al.  Combinatorial Pooling Enables Selective Sequencing of the Barley Gene Space , 2013, PLoS Comput. Biol..

[27]  Steven Salzberg,et al.  GAGE-B: an evaluation of genome assemblers for bacterial organisms , 2013, Bioinform..

[28]  Qichao Tu,et al.  Strain/species identification in metagenomes using genome-specific markers , 2014, Nucleic acids research.

[29]  Taeko Dohi,et al.  Dysbiosis of Salivary Microbiota in Inflammatory Bowel Disease and Its Association With Oral Immunological Biomarkers , 2013, DNA research : an international journal for rapid publication of reports on genes and genomes.

[30]  Simon Foucart,et al.  WGSQuikr: Fast Whole-Genome Shotgun Metagenomic Classification , 2014, PloS one.