A Statistical Framework for Accurate Taxonomic Assignment of Metagenomic Sequencing Reads

The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.

[1]  M. Pignatelli,et al.  The oral metagenome in health and disease , 2011, The ISME Journal.

[2]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[3]  Vineet K. Sharma,et al.  Fast and Accurate Taxonomic Assignments of Metagenomic Sequences Using MetaBin , 2012, PloS one.

[4]  P. Bork,et al.  Enterotypes of the human gut microbiome , 2011, Nature.

[5]  Li C. Xia,et al.  Accurate Genome Relative Abundance Estimation Based on Shotgun Metagenomic Reads , 2011, PloS one.

[6]  Thorsten Dickhaus,et al.  Simultaneous Statistical Inference , 2014, Springer Berlin Heidelberg.

[7]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[8]  Elena Marchiori,et al.  MTR: taxonomic annotation of short metagenomic reads using clustering at multiple taxonomic ranks , 2010, Bioinform..

[9]  Forest Rohwer,et al.  The GAAS Metagenomic Tool and Its Estimations of Viral and Microbial Average Genome Size in Four Major Biomes , 2009, PLoS Comput. Biol..

[10]  Daniel H. Huson,et al.  48. MetaSim: A Sequencing Simulator for Genomics and Metagenomics , 2011 .

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[13]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[14]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[15]  J. Stoye,et al.  Taxonomic classification of metagenomic shotgun sequences with CARMA3 , 2011, Nucleic acids research.

[16]  Monzoorul Haque Mohammed,et al.  SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences , 2009, Bioinform..

[17]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[18]  S. Kravitz,et al.  CAMERA: A Community Resource for Metagenomics , 2007, PLoS biology.

[19]  William A. Siebold,et al.  SAR11 clade dominates ocean surface bacterioplankton communities , 2002, Nature.

[20]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[21]  Gabriel Valiente,et al.  Accurate Taxonomic Assignment of Short Pyrosequencing Reads , 2010, Pacific Symposium on Biocomputing.

[22]  W. Ansorge Next-generation DNA sequencing techniques. , 2009, New biotechnology.

[23]  D. Pieper,et al.  Metagenomics reveals diversity and abundance of meta-cleavage pathways in microbial communities from soil highly contaminated with jet fuel under air-sparging bioremediation , 2009, Environmental microbiology.

[24]  T. Scheffer,et al.  Taxonomic metagenome sequence assignment with structured output models , 2011, Nature Methods.

[25]  Peter Meinicke,et al.  Mixture models for analysis of the taxonomic composition of metagenomes , 2011, Bioinform..

[26]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[27]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[30]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Mihai Pop,et al.  Deep Sequencing of the Oral Microbiome Reveals Signatures of Periodontal Disease , 2012, PloS one.

[32]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[33]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..