MTR: taxonomic annotation of short metagenomic reads using clustering at multiple taxonomic ranks

Motivation: Metagenomics is a recent field of biology that studies microbial communities by analyzing their genomic content directly sequenced from the environment. A metagenomic dataset consists of many short DNA or RNA fragments called reads. One interesting problem in metagenomic data analysis is the discovery of the taxonomic composition of a given dataset. A simple method for this task, called the Lowest Common Ancestor (LCA), is employed in state-of-the-art computational tools for metagenomic data analysis of very short reads (about 100 bp). However LCA has two main drawbacks: it possibly assigns many reads to high taxonomic ranks and it discards a high number of reads. Results: We present MTR, a new method for tackling these drawbacks using clustering at Multiple Taxonomic Ranks. Unlike LCA, which processes the reads one-by-one, MTR exploits information shared by reads. Specifically, MTR consists of two main phases. First, for each taxonomic rank, a collection of potential clusters of reads is generated, and each potential cluster is associated to a taxon at that rank. Next, a small number of clusters is selected at each rank using a combinatorial optimization algorithm. The effectiveness of the resulting method is tested on a large number of simulated and real-life metagenomes. Results of experiments show that MTR improves on LCA by discarding a significantly smaller number of reads and by assigning much more reads at lower taxonomic ranks. Moreover, MTR provides a more faithful taxonomic characterization of the metagenome population distribution. Availability: Matlab and C++ source codes of the method available at http://cs.ru.nl/˜gori/software/MTR.tar.gz. Contact: gori@cs.ru.nl; elenam@cs.ru.nl Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[2]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[3]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[4]  Gabriel Valiente,et al.  Accurate Taxonomic Assignment of Short Pyrosequencing Reads , 2010, Pacific Symposium on Biocomputing.

[5]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[6]  Steven Salzberg,et al.  Identifying bacterial genes and endosymbiont DNA with Glimmer , 2007, Bioinform..

[7]  J. Parkhill,et al.  Comparative genomic structure of prokaryotes. , 2004, Annual review of genetics.

[8]  F. Sanger,et al.  A Rapid Method for Determining Sequences in DNA by Primed Synthesis with DNA Polymerase , 1989 .

[9]  Elena Marchiori,et al.  Clustering Metagenome Short Reads Using Weighted Proteins , 2009, EvoBIO.

[10]  Daniel J. Blankenberg,et al.  A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. , 2007, Genome research.

[11]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[12]  Alice C McHardy,et al.  What's in the mix: phylogenetic classification of metagenome sequence samples. , 2007, Current opinion in microbiology.

[13]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[14]  M. Breitbart,et al.  Using pyrosequencing to shed light on deep mine microbial ecology , 2006, BMC Genomics.

[15]  Rick L. Stevens,et al.  The SEED: a peer-to-peer environment for genome annotation , 2004, CACM.

[16]  F. Ausubel Phylogenetic identification and in situ detection of individual microbial cell without cultivation. , 2010 .

[17]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[18]  Sean D. Hooper,et al.  Annotation of metagenome short reads using proxygenes , 2008, ECCB.

[19]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[20]  Stefan Hougardy,et al.  A linear-time approximation algorithm for weighted matchings in graphs , 2005, TALG.

[21]  R. Knight,et al.  Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers , 2008, Nucleic acids research.

[22]  P. Bork,et al.  Get the most out of your metagenome: computational analysis of environmental sequence data. , 2007, Current opinion in microbiology.

[23]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[24]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[25]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[26]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[27]  Reuven Bar-Yehuda,et al.  A Linear-Time Approximation Algorithm for the Weighted Vertex Cover Problem , 1981, J. Algorithms.

[28]  P. Bork,et al.  Environments shape the nucleotide composition of genomes , 2005, EMBO reports.

[29]  K. Schleifer,et al.  Phylogenetic identification and in situ detection of individual microbial cells without cultivation. , 1995, Microbiological reviews.

[30]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[31]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[32]  Naryttza N. Diaz,et al.  Phylogenetic classification of short environmental DNA fragments , 2008, Nucleic acids research.

[33]  Forest Rohwer,et al.  Metagenomic analysis of the microbial community associated with the coral Porites astreoides. , 2007, Environmental microbiology.