Accurate annotation of metagenomic data without species-level references

Taxonomic annotation is a critical first step for analysis of metagenomic data. Despite a lot of tools being developed, the accuracy is still not satisfactory, in particular, when a close species-level reference does not exist in the database. In this paper, we propose a novel annotation tool, MetaAnnotator, to annotate metagenomic reads, which outperforms all existing tools significantly when only genus-level references exist in the database. From our experiments, MetaAnnotator can assign 87.5% reads correctly (67.5% reads are assigned to the exact genus) with only 8.5% reads wrongly assigned. The best existing tool (MetaCluster-TA) can only achieve 73.4% correct read assignment (with only 50.9% reads assigned to the exact genus and 22.6% reads wrongly assigned). The speed of MetaAnnotator is also the second faster (1 hour for 20 million reads). The core concepts behind MetaAnnotator includes: (i) we only consider exact k-mers in coding regions of the references as they should be more significant and accurate; (ii) to assign reads to taxonomy nodes, we construct genome and taxonomy specific probabilistic models from the reference database; and (iii) using the BWT data structure to speed up the k-mer matching process.

[1]  Pradeep Ravikumar,et al.  QUIC: quadratic approximation for sparse inverse covariance estimation , 2014, J. Mach. Learn. Res..

[2]  Siu-Ming Yiu,et al.  MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning , 2014, BMC Genomics.

[3]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[4]  J. Handelsman,et al.  Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[5]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[6]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[7]  Jean-Philippe Vert,et al.  Large-scale machine learning for metagenomics sequence classification , 2015, Bioinform..

[8]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[9]  Chongle Pan,et al.  Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance , 2014, Bioinform..

[10]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[13]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[14]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[15]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[16]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[17]  Siu-Ming Yiu,et al.  MetaCluster: unsupervised binning of environmental genomic fragments and taxonomic annotation , 2010, BCB '10.

[18]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[19]  Nanny Wermuth,et al.  Multivariate Dependencies: Models, Analysis and Interpretation , 1996 .