A Bioinformatician's Guide to Metagenomics

SUMMARY As random shotgun metagenomic projects proliferate and become the dominant source of publicly available sequence data, procedures for the best practices in their execution and analysis become increasingly important. Based on our experience at the Joint Genome Institute, we describe the chain of decisions accompanying a metagenomic project from the viewpoint of the bioinformatic analysis step by step. We guide the reader through a standard workflow for a metagenomic project beginning with presequencing considerations such as community composition and sequence data type that will greatly influence downstream analyses. We proceed with recommendations for sampling and data generation including sample and metadata collection, community profiling, construction of shotgun libraries, and sequencing strategies. We then discuss the application of generic sequence processing steps (read preprocessing, assembly, and gene prediction and annotation) to metagenomic data sets in contrast to genome projects. Different types of data analyses particular to metagenomes are then presented, including binning, dominant population analysis, and gene-centric analysis. Finally, data management issues are presented and discussed. We hope that this review will assist bioinformaticians and biologists in making better-informed decisions on their journey during a metagenomic project.

[1]  M. Wagner,et al.  Microbial diversity and the genetic nature of microbial species , 2008, Nature Reviews Microbiology.

[2]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[3]  Mikhail S. Gelfand,et al.  Combining diverse evidence for gene recognition in completely sequenced bacterial genomes , 1998, German Conference on Bioinformatics.

[4]  Josh D Neufeld,et al.  Marine methylotrophs revealed by stable-isotope probing, multiple displacement amplification and metagenomics. , 2008, Environmental microbiology.

[5]  C. A. Thomas The genetic organization of chromosomes. , 1971, Annual review of genetics.

[6]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[7]  E. Koonin,et al.  Construction and analysis of bacterial artificial chromosome libraries from a marine microbial assemblage. , 2000, Environmental microbiology.

[8]  Hui-Hsien Chou,et al.  DNA sequence quality trimming and vector removal , 2001, Bioinform..

[9]  Bernd Hamann,et al.  SNP-VISTA: An interactive SNP visualization tool , 2005, BMC Bioinformatics.

[10]  S. Karlin,et al.  Dinucleotide relative abundance extremes: a genomic signature. , 1995, Trends in genetics : TIG.

[11]  Adam M. Phillippy,et al.  Comparative genome assembly , 2004, Briefings Bioinform..

[12]  John M. Hancock,et al.  CoGenT++: an extensive and extensible data environment for computational genomics , 2005, Bioinform..

[13]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J. Raes,et al.  Quantitative assessment of protein function prediction from metagenomics shotgun sequences , 2007, Proceedings of the National Academy of Sciences.

[15]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[16]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[17]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[18]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[19]  J. Banfield,et al.  Rapidly evolving CRISPRs implicated in acquired resistance of microorganisms to viruses. , 2007, Environmental microbiology.

[20]  V. Kunin,et al.  A bacterial metapopulation adapts locally to phage predation despite global dispersal. , 2008, Genome research.

[21]  Alice C McHardy,et al.  What's in the mix: phylogenetic classification of metagenome sequence samples. , 2007, Current opinion in microbiology.

[22]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[23]  Natalia N. Ivanova,et al.  Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite , 2007, Nature.

[24]  Curtis A Suttle,et al.  Metagenomic Analysis of Coastal RNA Virus Communities , 2006, Science.

[25]  Hiroshi Mori,et al.  Comparative Metagenomics Revealed Commonly Enriched Gene Sets in Human Gut Microbiomes , 2007, DNA research : an international journal for rapid publication of reports on genes and genomes.

[26]  Susan M. Huse,et al.  Microbial Population Structures in the Deep Marine Biosphere , 2007, Science.

[27]  Jillian F Banfield,et al.  Population genomics in natural microbial communities. , 2006, Trends in ecology & evolution.

[28]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[29]  Christian von Mering,et al.  STRING 7—recent developments in the integration and prediction of protein interactions , 2006, Nucleic Acids Res..

[30]  Michael B. Eisen,et al.  Rapid quantitative profiling of complex microbial populations , 2006, Nucleic acids research.

[31]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[32]  Edward F. DeLong,et al.  Microbial community genomics in the ocean , 2005, Nature Reviews Microbiology.

[33]  Jillian F. Banfield,et al.  Community genomics in microbial ecology and evolution , 2005, Nature Reviews Microbiology.

[34]  F. Cohan What are bacterial species? , 2002, Annual review of microbiology.

[35]  M. Borodovsky,et al.  Heuristic approach to deriving models for gene finding. , 1999, Nucleic acids research.

[36]  Henk Bolhuis,et al.  Environmental genomics of "Haloquadratum walsbyi" in a saltern crystallizer indicates a large pool of accessory genes in an otherwise coherent species , 2006, BMC Genomics.

[37]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[38]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[39]  Karsten Zengler,et al.  Targeted Access to the Genomes of Low-Abundance Organisms in Complex Microbial Communities , 2007, Applied and Environmental Microbiology.

[40]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[41]  Walter R. Gilks,et al.  Modeling the percolation of annotation errors in a database of protein sequences , 2002, Bioinform..

[42]  Joseph J. Ferretti,et al.  Identification, Cloning, and Expression of the CAMP factor gene (cfa) of Group A Streptococci , 1999, Infection and Immunity.

[43]  Vincent J. Denef,et al.  Strain-resolved community proteomics reveals recombining genomes of acidophilic bacteria , 2007, Nature.

[44]  Michelle G. Giglio,et al.  TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes , 2006, Nucleic Acids Res..

[45]  Inna Dubchak,et al.  An experimental metagenome data management and analysis system , 2006, ISMB.

[46]  E. Delong,et al.  Community Genomics Among Stratified Microbial Assemblages in the Ocean's Interior , 2006, Science.

[47]  Michael Wagner,et al.  Fluorescence in situ hybridisation for the identification and characterisation of prokaryotes. , 2003, Current opinion in microbiology.

[48]  Sophia Tsoka,et al.  Beyond 100 genomes , 2003, Genome Biology.

[49]  J. Wingender,et al.  Metagenome Survey of Biofilms in Drinking-Water Networks , 2003, Applied and Environmental Microbiology.

[50]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[51]  Jaysheel D. Bhavsar,et al.  Metagenomics: Read Length Matters , 2008, Applied and Environmental Microbiology.

[52]  C. Ouzounis,et al.  Whole‐genome sequence annotation: ‘Going wrong with confidence’ , 1999, Molecular microbiology.

[53]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[54]  Christos A. Ouzounis,et al.  Clustering the annotation space of proteins , 2005, BMC Bioinformatics.

[55]  Dmitrij Frishman,et al.  Deciphering the evolution and metabolism of an anammox bacterium from a community genome , 2006, Nature.

[56]  E. Mauceli,et al.  Whole-genome sequence assembly for mammalian genomes: Arachne 2. , 2003, Genome research.

[57]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[58]  Owen White,et al.  The Comprehensive Microbial Resource , 2001, Nucleic Acids Res..

[59]  Philip L. F. Johnson,et al.  Accounting for bias from sequencing error in population genetic estimates. , 2007, Molecular biology and evolution.

[60]  Devdatt P. Dubhashi,et al.  Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures , 2006, Bioinform..

[61]  Daniela Bartels,et al.  Finding novel genes in bacterial communities isolated from the environment , 2006, ISMB.

[62]  Jillian F. Banfield,et al.  Genome dynamics in a natural archaeal population , 2007, Proceedings of the National Academy of Sciences.

[63]  Jörg Peplies,et al.  Application and validation of DNA microarrays for the 16S rRNA-based analysis of marine bacterioplankton. , 2004, Environmental microbiology.

[64]  Daniel Rokhsar,et al.  Reverse Methanogenesis: Testing the Hypothesis with Environmental Genomics , 2004, Science.

[65]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[66]  Alla Lapidus,et al.  Genome Sequence Databases (Overview): Sequencing and Assembly , 2009 .

[67]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[68]  G. Olsen,et al.  CRITICA: coding region identification tool invoking comparative analysis. , 1999, Molecular biology and evolution.

[69]  R. Edwards,et al.  Viral metagenomics , 2005, Nature Reviews Microbiology.

[70]  Maureen L. Coleman,et al.  Microbial community gene expression in ocean surface waters , 2008, Proceedings of the National Academy of Sciences.

[71]  Eoin L Brodie,et al.  Application of a High-Density Oligonucleotide Microarray Approach To Study Bacterial Population Dynamics during Uranium Reduction and Reoxidation , 2006, Applied and Environmental Microbiology.

[72]  E. Delong,et al.  Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum , 2006, Proceedings of the National Academy of Sciences.

[73]  F. Sanger,et al.  A Rapid Method for Determining Sequences in DNA by Primed Synthesis with DNA Polymerase , 1989 .

[74]  Rudolf Amann,et al.  Flow Sorting of Marine Bacterioplankton after Fluorescence In Situ Hybridization , 2004, Applied and Environmental Microbiology.

[75]  Jonas Korlach,et al.  Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures , 2008, Proceedings of the National Academy of Sciences.

[76]  K Nishikawa,et al.  Genes from nine genomes are separated into their organisms in the dinucleotide composition space. , 1998, DNA research : an international journal for rapid publication of reports on genes and genomes.

[77]  P. Hugenholtz Exploring prokaryotic diversity in the genomic era , 2002, Genome Biology.

[78]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[79]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[80]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[81]  D. Ecker,et al.  RNAMotif, an RNA secondary structure definition and search algorithm. , 2001, Nucleic acids research.

[82]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[83]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[84]  Wayne M. Getz,et al.  Strainer: software for analysis of population variation in community genomic datasets , 2007, BMC Bioinformatics.

[85]  Werner Liesack,et al.  Genome of Rice Cluster I Archaea—the Key Methane Producers in the Rice Rhizosphere , 2006, Science.

[86]  Georges N. Cohen,et al.  “Candidatus Cloacamonas Acidaminovorans”: Genome Sequence Reconstruction Provides a First Glimpse of a New Bacterial Division , 2008, Journal of bacteriology.

[87]  Mark B Gerstein,et al.  Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing , 2006, BMC Genomics.

[88]  Inna Dubchak,et al.  The integrated microbial genomes (IMG) system , 2005, Nucleic Acids Res..

[89]  M. Ronaghi Pyrosequencing sheds light on DNA sequencing. , 2001, Genome research.

[90]  Tao Zhang,et al.  RNA Viral Community in Human Feces: Prevalence of Plant Pathogenic Viruses , 2005, PLoS biology.

[91]  M. Breitbart,et al.  Using pyrosequencing to shed light on deep mine microbial ecology , 2006, BMC Genomics.

[92]  Natalia Ivanova,et al.  The ERGOTM genome analysis and discovery system , 2003, Nucleic Acids Res..

[93]  K. Schleifer,et al.  ARB: a software environment for sequence data. , 2004, Nucleic acids research.

[94]  Maureen L. Coleman,et al.  Genomic Islands and the Ecology and Evolution of Prochlorococcus , 2006, Science.

[95]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[96]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[97]  R. Sandberg,et al.  Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. , 2001, Genome research.

[98]  Susan M. Huse,et al.  Microbial diversity in the deep sea and the underexplored “rare biosphere” , 2006, Proceedings of the National Academy of Sciences.

[99]  Alan Mackay,et al.  Evaluation of Phi29-based whole-genome amplification for microarray-based comparative genomic hybridisation , 2007, Laboratory Investigation.

[100]  Philip Hugenholtz,et al.  Building on basic metagenomics with complementary technologies , 2007, Genome Biology.

[101]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[102]  S. Giovannoni,et al.  Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR , 1996, Applied and environmental microbiology.

[103]  Matthew Berriman,et al.  ACT: the Artemis comparison tool , 2005, Bioinform..

[104]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[105]  B. Andresen,et al.  Genomic analysis of uncultured marine viral communities , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[106]  Ernest Szeto,et al.  Symbiosis insights through metagenomic analysis of a microbial consortium. , 2006, Nature Reviews Microbiology.

[107]  Chris F. Taylor,et al.  The minimum information about a genome sequence (MIGS) specification , 2008, Nature Biotechnology.

[108]  Jarek Nieplocha,et al.  ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis , 2006, IEEE Transactions on Parallel and Distributed Systems.

[109]  Peer Bork,et al.  Genome-Wide Experimental Determination of Barriers to Horizontal Gene Transfer , 2007, Science.

[110]  Roland A H van Oorschot,et al.  Decreasing amplification bias associated with multiple displacement amplification and short tandem repeat genotyping. , 2007, Analytical biochemistry.

[111]  Forest Rohwer,et al.  An application of statistics to comparative metagenomics , 2006, BMC Bioinformatics.

[112]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[113]  Lior Pachter,et al.  Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities , 2005, PLoS Comput. Biol..

[114]  Yanhe Ma,et al.  Identification of Eukaryotic Open Reading Frames in Metagenomic cDNA Libraries Made from Environmental Samples , 2006, Applied and Environmental Microbiology.

[115]  R Amann,et al.  The identification of microorganisms by fluorescence in situ hybridisation. , 2001, Current opinion in biotechnology.

[116]  J. Spudich,et al.  New Insights into Metabolic Properties of Marine Bacteria Encoding Proteorhodopsins , 2005, PLoS biology.

[117]  Kathryn F. Beal,et al.  The Staden package, 1998. , 2000, Methods in molecular biology.

[118]  Jean-Michel Claverie,et al.  FusionDB: a database for in-depth analysis of prokaryotic gene fusion events , 2004, Nucleic Acids Res..

[119]  A. Halpern,et al.  A Sanger/pyrosequencing hybrid approach for the generation of high-quality draft assemblies of marine microbial genomes. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[120]  Dinanath Sulakhe,et al.  PUMA2—grid-based high-throughput analysis of genomes and metabolic pathways , 2005, Nucleic Acids Res..

[121]  Daniel H. Huson,et al.  Simultaneous Assessment of Soil Microbial Community Structure and Function through Analysis of the Meta-Transcriptome , 2008, PloS one.

[122]  Edward M. Rubin,et al.  Metagenomics: DNA sequencing of environmental samples , 2005, Nature Reviews Genetics.

[123]  Purificación López-García,et al.  Metagenomics of the Deep Mediterranean, a Warm Bathypelagic Habitat , 2007, PloS one.

[124]  Anne Bergeron,et al.  Divide and Conquer: Enriching Environmental Sequencing Data , 2007, PloS one.

[125]  B. Snel,et al.  Genome phylogeny based on gene content , 1999, Nature Genetics.

[126]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[127]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[128]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[129]  Michael Wagner,et al.  daime, a novel image analysis program for microbial ecology and biofilm research. , 2006, Environmental microbiology.

[130]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[131]  Sean R. Eddy,et al.  Rfam: annotating non-coding RNAs in complete genomes , 2004, Nucleic Acids Res..

[132]  Shigehiko Kanaya,et al.  Informatics for unveiling hidden genome signatures. , 2003, Genome research.

[133]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[134]  Katherine H. Huang,et al.  The MicrobesOnline Web site for comparative genomics. , 2005, Genome research.

[135]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[136]  Nikos Kyrpides,et al.  The Positive Role of the Ecological Community in the Genomic Revolution , 2006, Microbial Ecology.

[137]  Peer Bork,et al.  Millimeter-scale genetic gradients and community-level molecular convergence in a hypersaline microbial mat , 2008, Molecular systems biology.

[138]  James R. Cole,et al.  The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data , 2006, Nucleic Acids Res..

[139]  Lior Pachter,et al.  VISTA: computational tools for comparative genomics , 2004, Nucleic Acids Res..

[140]  Philip L. F. Johnson,et al.  Inference of population genetic parameters in metagenomics: a clean look at messy data. , 2006, Genome research.

[141]  Natalia Ivanova,et al.  Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities , 2006, Nature Biotechnology.

[142]  L. Brocchieri,et al.  Phylogenetic inferences from molecular sequences: review and critique. , 2001, Theoretical population biology.

[143]  Hideaki Sugawara,et al.  Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. , 2005, DNA research : an international journal for rapid publication of reports on genes and genomes.

[144]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[145]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[146]  Victor M Markowitz,et al.  Microbial genome data resources. , 2007, Current opinion in biotechnology.

[147]  S. Kravitz,et al.  CAMERA: A Community Resource for Metagenomics , 2007, PLoS biology.

[148]  Florent E. Angly,et al.  The Marine Viromes of Four Oceanic Regions , 2006, PLoS biology.

[149]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[150]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[151]  O. Dyer These include: , 1994 .

[152]  U. Göbel,et al.  Determination of microbial diversity in environmental samples: pitfalls of PCR-based rRNA analysis. , 1997, FEMS microbiology reviews.

[153]  S. Quake,et al.  Dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth , 2007, Proceedings of the National Academy of Sciences.