论文信息 - Accurate annotation of metagenomic data without species-level references

Accurate annotation of metagenomic data without species-level references

Taxonomic annotation is a critical first step for analysis of metagenomic data. Despite a lot of tools being developed, the accuracy is still not satisfactory, in particular, when a close species-level reference does not exist in the database. In this paper, we propose a novel annotation tool, MetaAnnotator, to annotate metagenomic reads, which outperforms all existing tools significantly when only genus-level references exist in the database. From our experiments, MetaAnnotator can assign 87.5% reads correctly (67.5% reads are assigned to the exact genus) with only 8.5% reads wrongly assigned. The best existing tool (MetaCluster-TA) can only achieve 73.4% correct read assignment (with only 50.9% reads assigned to the exact genus and 22.6% reads wrongly assigned). The speed of MetaAnnotator is also the second faster (1 hour for 20 million reads). The core concepts behind MetaAnnotator includes: (i) we only consider exact k-mers in coding regions of the references as they should be more significant and accurate; (ii) to assign reads to taxonomy nodes, we construct genome and taxonomy specific probabilistic models from the reference database; and (iii) using the BWT data structure to speed up the k-mer matching process.

[1] Pradeep Ravikumar,et al. QUIC: quadratic approximation for sparse inverse covariance estimation , 2014, J. Mach. Learn. Res..

[2] Siu-Ming Yiu,et al. MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning , 2014, BMC Genomics.

[3] Gail L. Rosen,et al. NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[4] J. Handelsman,et al. Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[5] Maya Gokhale,et al. Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[6] P. Bork,et al. A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[7] Jean-Philippe Vert,et al. Large-scale machine learning for metagenomics sequence classification , 2015, Bioinform..

[8] Alexander F. Auch,et al. MEGAN analysis of metagenomic data. , 2007, Genome research.

[9] Chongle Pan,et al. Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance , 2014, Bioinform..

[10] Derrick E. Wood,et al. Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[11] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[12] S. Salzberg,et al. Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[13] Yu-Wei Wu,et al. A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[14] M. Yuan,et al. Model selection and estimation in the Gaussian graphical model , 2007 .

[15] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[16] Siu-Ming Yiu,et al. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[17] Siu-Ming Yiu,et al. MetaCluster: unsupervised binning of environmental genomic fragments and taxonomic annotation , 2010, BCB '10.

[18] Anders F. Andersson,et al. Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[19] Nanny Wermuth,et al. Multivariate Dependencies: Models, Analysis and Interpretation , 1996 .