CS-SCORE: Rapid identification and removal of human genome contaminants from metagenomic datasets.

UNLABELLED Metagenomic sequencing data, obtained from host-associated microbial communities, are usually contaminated with host genome sequence fragments. Prior to performing any downstream analyses, it is necessary to identify and remove such contaminating sequence fragments. The time and memory requirements of available host-contamination detection techniques are enormous. Thus, processing of large metagenomic datasets is a challenging task. This study presents CS-SCORE--a novel algorithm that can rapidly identify host sequences contaminating metagenomic datasets. Validation results indicate that CS-SCORE is 2-6 times faster than the current state-of-the-art methods. Furthermore, the memory footprint of CS-SCORE is in the range of 2-2.5GB, which is significantly lower than other available tools. CS-SCORE achieves this efficiency by incorporating (1) a heuristic pre-filtering mechanism and (2) a directed-mapping approach that utilizes a novel sequence composition metric (cs-score). CS-SCORE is expected to be a handy 'pre-processing' utility for researchers analyzing metagenomic datasets. AVAILABILITY For academic users, an implementation of CS-SCORE is freely available at: http://metagenomics.atc.tcs.com/cs-score (or) https://metagenomics.atc.tcs.com/preprocessing/cs-score.

[1]  A. Künstner,et al.  ConDeTri - A Content Dependent Read Trimmer for Illumina Data , 2011, PloS one.

[2]  Kan Liu,et al.  BIGpre: A Quality Assessment Package for Next-Generation Sequencing Data , 2011, Genom. Proteom. Bioinform..

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  Monzoorul Haque Mohammed,et al.  SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences , 2009, Bioinform..

[5]  Monzoorul Haque Mohammed,et al.  SPHINX - an algorithm for taxonomic binning of metagenomic sequences , 2011, Bioinform..

[6]  Monzoorul Haque Mohammed,et al.  DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences , 2010, BMC Bioinformatics.

[7]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[8]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[9]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[10]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[11]  Monzoorul Haque Mohammed,et al.  INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences , 2011, BMC Genomics.

[12]  Vincent Ferretti,et al.  Evaluation of Alignment Algorithms for Discovery and Identification of Pathogens Using RNA-Seq , 2013, PloS one.

[13]  Intawat Nookaew,et al.  FANTOM: Functional and taxonomic analysis of metagenomes , 2013, BMC Bioinformatics.

[14]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[15]  Monzoorul Haque Mohammed,et al.  HabiSign: a novel approach for comparison of metagenomes and rapid identification of habitat-specific sequences , 2011, BMC Bioinformatics.

[16]  Siu-Ming Yiu,et al.  Meta-IDBA: a de Novo assembler for metagenomic data , 2011, Bioinform..

[17]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[18]  Patrick J. Biggs,et al.  SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data , 2010, BMC Bioinformatics.

[19]  Sharmila S. Mande,et al.  Gut Microbiomes of Indian Children of Varying Nutritional Status , 2014, PloS one.

[20]  Monzoorul Haque Mohammed,et al.  MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets. , 2014, Genomics.

[21]  Monzoorul Haque Mohammed,et al.  Metagenome of the gut of a malnourished child , 2011, Gut pathogens.

[22]  Sharmila S Mande,et al.  Community-analyzer: a platform for visualizing and comparing microbial community structure across microbiomes. , 2013, Genomics.

[23]  V. Jagannathan,et al.  Blackboard Architectures and Applications , 1989 .

[24]  Monzoorul Haque Mohammed,et al.  TWARIT: an extremely rapid and efficient approach for phylogenetic classification of metagenomic sequences. , 2012, Gene.

[25]  David S. Wishart,et al.  METAGENassist: a comprehensive web server for comparative metagenomics , 2012, Nucleic Acids Res..

[26]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[27]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[28]  Daniel D. Sommer,et al.  MetAMOS: a modular and open source metagenomic assembly and analysis pipeline , 2013, Genome Biology.

[29]  Monzoorul Haque Mohammed,et al.  Eu-Detect: An algorithm for detecting eukaryotic sequences in metagenomic data sets , 2011, Journal of Biosciences.

[30]  M. Emond,et al.  Accuracy of Next Generation Sequencing Platforms. , 2014, Next generation, sequencing & applications.