Analysis and comparison of very large metagenomes with fast clustering and functional annotation

BackgroundThe remarkable advance of metagenomics presents significant new challenges in data analysis. Metagenomic datasets (metagenomes) are large collections of sequencing reads from anonymous species within particular environments. Computational analyses for very large metagenomes are extremely time-consuming, and there are often many novel sequences in these metagenomes that are not fully utilized. The number of available metagenomes is rapidly increasing, so fast and efficient metagenome comparison methods are in great demand.ResultsThe new metagenomic data analysis method Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) was developed using an ultra-fast sequence clustering algorithm, fast protein family annotation tools, and a novel statistical metagenome comparison method that employs a unique graphic interface. RAMMCAP processes extremely large datasets with only moderate computational effort. It identifies raw read clusters and protein clusters that may include novel gene families, and compares metagenomes using clusters or functional annotations calculated by RAMMCAP. In this study, RAMMCAP was applied to the two largest available metagenomic collections, the "Global Ocean Sampling" and the "Metagenomic Profiling of Nine Biomes".ConclusionRAMMCAP is a very fast method that can cluster and annotate one million metagenomic reads in only hundreds of CPU hours. It is available from http://tools.camera.calit2.net/camera/rammcap/.

[1]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[2]  Steve Johnson,et al.  Improved HMMERHEAD for Better Sensitivity , 2007 .

[3]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[4]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[5]  S. Kravitz,et al.  CAMERA: A Community Resource for Metagenomics , 2007, PLoS biology.

[6]  Florent E. Angly,et al.  The Marine Viromes of Four Oceanic Regions , 2006, PLoS biology.

[7]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[8]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[9]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[10]  Rick L. Stevens,et al.  Functional metagenomic profiling of nine biomes , 2008, Nature.

[11]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[12]  T. Itoh,et al.  MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[13]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[14]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[15]  Forest Rohwer,et al.  An application of statistics to comparative metagenomics , 2006, BMC Bioinformatics.

[16]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[17]  Jo Handelsman,et al.  A statistical toolbox for metagenomics: assessing functional diversity in microbial communities , 2008, BMC Bioinformatics.

[18]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[19]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[20]  Gabor T. Marth,et al.  Pyrobayes: an improved base caller for SNP discovery in pyrosequences , 2008, Nature Methods.

[21]  Adam Godzik,et al.  Tolerating some redundancy significantly speeds up clustering of large protein databases , 2002, Bioinform..

[22]  Katharina J. Hoff,et al.  BMC Bioinformatics BioMed Central Methodology article Gene prediction in metagenomic fragments: A large scale machine , 2008 .

[23]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[24]  Shibu Yooseph,et al.  Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering , 2007, BMC Bioinformatics.

[25]  D. Sheskin Handbook of Parametric and Nonparametric Statistical Procedures: Third Edition , 2000 .

[26]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[27]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[28]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[29]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[30]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[31]  A. Godzik,et al.  Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets , 2008, PloS one.

[32]  Naryttza N. Diaz,et al.  Phylogenetic classification of short environmental DNA fragments , 2008, Nucleic acids research.

[33]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[34]  E. Delong,et al.  Community Genomics Among Stratified Microbial Assemblages in the Ocean's Interior , 2006, Science.

[35]  I-Min A. Chen,et al.  IMG/M: a data management and analysis system for metagenomes , 2007, Nucleic Acids Res..