Functional assignment of metagenomic data: challenges and applications

Metagenomic sequencing provides a unique opportunity to explore earth’s limitless environments harboring scores of yet unknown and mostly unculturable microbes and other organisms. Functional analysis of the metagenomic data plays a central role in projects aiming to explore the most essential questions in microbiology, namely ‘In a given environment, among the microbes present, what are they doing, and how are they doing it?’ Toward this goal, several large-scale metagenomic projects have recently been conducted or are currently underway. Functional analysis of metagenomic data mainly suffers from the vast amount of data generated in these projects. The shear amount of data requires much computational time and storage space. These problems are compounded by other factors potentially affecting the functional analysis, including, sample preparation, sequencing method and average genome size of the metagenomic samples. In addition, the read-lengths generated during sequencing influence sequence assembly, gene prediction and subsequently the functional analysis. The level of confidence for functional predictions increases with increasing read-length. Usually, the most reliable functional annotations for metagenomic sequences are achieved using homology-based approaches against publicly available reference sequence databases. Here, we present an overview of the current state of functional analysis of metagenomic sequence data, bottlenecks frequently encountered and possible solutions in light of currently available resources and tools. Finally, we provide some examples of applications from recent metagenomic studies which have been successfully conducted in spite of the known difficulties.

[1]  Forest Rohwer,et al.  Metagenomic Analysis of Respiratory Tract DNA Viral Communities in Cystic Fibrosis and Non-Cystic Fibrosis Individuals , 2009, PloS one.

[2]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[3]  Johannes Goll,et al.  Bioinformatics Applications Note Database and Ontologies Metarep: Jcvi Metagenomics Reports—an Open Source Tool for High-performance Comparative Metagenomics , 2022 .

[4]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[5]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[6]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[7]  Hans Söderlund,et al.  Algorithms for the search of amino acid patterns in nucleic acid sequences , 1986, Nucleic Acids Res..

[8]  Eran Halperin,et al.  FramePlus: aligning DNA to protein sequences , 1999, Bioinform..

[9]  Peer Bork,et al.  SMART 7: recent updates to the protein domain annotation resource , 2011, Nucleic Acids Res..

[10]  I-Min A. Chen,et al.  IMG/M: the integrated metagenome data management and comparative analysis system , 2011, Nucleic Acids Res..

[11]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[12]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[13]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[14]  Christoph H Borchers,et al.  Metabolomics: towards understanding host-microbe interactions. , 2010, Future microbiology.

[15]  Paramvir S. Dehal,et al.  Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes , 2002, Science.

[16]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[17]  Dan R. Littman,et al.  Induction of Intestinal Th17 Cells by Segmented Filamentous Bacteria , 2009, Cell.

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  D. M. Taverna,et al.  Why are proteins marginally stable? , 2002, Proteins.

[20]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[22]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[23]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[24]  Olli Simell,et al.  Gut Microbiome Metagenomics Analysis Suggests a Functional Model for the Development of Autoimmunity for Type 1 Diabetes , 2011, PloS one.

[25]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[26]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[27]  Bernard Henrissat,et al.  Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins , 2010, Proceedings of the National Academy of Sciences.

[28]  Michelle G. Giglio,et al.  TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes , 2006, Nucleic Acids Res..

[29]  M. Ventura,et al.  From bacterial genome to functionality; case bifidobacteria. , 2007, International journal of food microbiology.

[30]  J. Kikuchi,et al.  Dynamic omics approach identifies nutrition-mediated microbial interactions. , 2011, Journal of proteome research.

[31]  Robert D. Finn,et al.  InterPro in 2011: new developments in the family and domain prediction database , 2011, Nucleic acids research.

[32]  Peer Bork,et al.  SmashCommunity: a metagenomic annotation and analysis tool , 2010, Bioinform..

[33]  J. Doré,et al.  Functional metagenomics to mine the human gut microbiome for dietary fiber catabolic enzymes. , 2010, Genome research.

[34]  Tracy K. Teal,et al.  Systematic artifacts in metagenomes from complex microbial communities , 2009, The ISME Journal.

[35]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[36]  Jed Fuhrman,et al.  Faculty Opinions recommendation of IMG/M: the integrated metagenome data management and comparative analysis system. , 2012 .

[37]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[38]  D. Antonopoulos,et al.  Using the metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes. , 2010, Cold Spring Harbor protocols.

[39]  E. Delong,et al.  Community Genomics Among Stratified Microbial Assemblages in the Ocean's Interior , 2006, Science.

[40]  W. D. de Vos,et al.  Metaproteomics Approach To Study the Functionality of the Microbiota in the Human Infant Gastrointestinal Tract , 2006, Applied and Environmental Microbiology.

[41]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[42]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[43]  Lu Wang,et al.  The NIH Human Microbiome Project. , 2009, Genome research.

[44]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[45]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[46]  P. Bork,et al.  Prediction of effective genome size in metagenomic samples , 2007, Genome Biology.

[47]  E. Nimwegen Scaling Laws in the Functional Content of Genomes , 2003, physics/0307001.

[48]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[49]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Vineet K. Sharma,et al.  Complete genome sequences of rat and mouse segmented filamentous bacteria, a potent inducer of th17 cell differentiation. , 2011, Cell host & microbe.

[51]  Xiaojun Guan,et al.  Alignments of DNA and protein sequences containing frameshift errors , 1996, Comput. Appl. Biosci..

[52]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[53]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[54]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[55]  Hiroshi Mori,et al.  Comparative Metagenomics Revealed Commonly Enriched Gene Sets in Human Gut Microbiomes , 2007, DNA research : an international journal for rapid publication of reports on genes and genomes.

[56]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[57]  Florent E. Angly,et al.  Phage-bacteria relationships and CRISPR elements revealed by a metagenomic survey of the rumen microbiome. , 2012, Environmental microbiology.

[58]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[59]  Peer Bork,et al.  Protein function space: viewing the limits or limited by our view? , 2007, Current opinion in structural biology.

[60]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[61]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[62]  Stephan Frickenhaus,et al.  Average genome size: a potential source of bias in comparative metagenomics , 2010, The ISME Journal.

[63]  A. Singh,et al.  Single cell genome sequencing. , 2012, Current opinion in biotechnology.

[64]  L. Ogilvie,et al.  Metagenomic marine nitrogen fixation--feast or famine? , 2005, Trends in microbiology.

[65]  Edward M. Rubin,et al.  Metagenomics: DNA sequencing of environmental samples , 2005, Nature Reviews Genetics.

[66]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[67]  A. Moya,et al.  Mining Virulence Genes Using Metagenomics , 2011, PloS one.

[68]  Xiaohua Hu,et al.  Average gene length is highly conserved in prokaryotes and eukaryotes and diverges only between the two kingdoms. , 2006, Molecular biology and evolution.

[69]  Chris Sander,et al.  Frame: detection of genomic sequencing errors , 1998, Bioinform..

[70]  Vineet K. Sharma,et al.  Fast and Accurate Taxonomic Assignments of Metagenomic Sequences Using MetaBin , 2012, PloS one.

[71]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[72]  S. Wakeham,et al.  Detection of microbial biomass by intact polar membrane lipid analysis in the water column and surface sediments of the Black Sea. , 2009, Environmental microbiology.

[73]  W. D. de Vos,et al.  Comparative Metaproteomics and Diversity Analysis of Human Intestinal Microbiota Testifies for Its Temporal Stability and Expression of Core Functions , 2012, PloS one.

[74]  Daniel H. Huson,et al.  48. MetaSim: A Sequencing Simulator for Genomics and Metagenomics , 2011 .

[75]  Elaine Holmes,et al.  Systemic multicompartmental effects of the gut microbiome on mouse metabolic phenotypes , 2008, Molecular systems biology.

[76]  Yuan Zhang,et al.  HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors , 2011, BMC Bioinformatics.

[77]  Karin A Remington,et al.  Taking metagenomic studies in context. , 2005, Trends in microbiology.

[78]  Wolfgang Gerlach,et al.  WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads , 2009, BMC Bioinformatics.

[79]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[80]  Kazuki Saito,et al.  Metabolomics for functional genomics, systems biology, and biotechnology. , 2010, Annual review of plant biology.

[81]  Weizhong Li,et al.  Analysis and comparison of very large metagenomes with fast clustering and functional annotation , 2009, BMC Bioinformatics.

[82]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[83]  T. Graeber,et al.  A Metaproteomic Approach to Study Human-Microbial Ecosystems at the Mucosal Luminal Interface , 2011, PloS one.

[84]  Annika C. Mosier,et al.  Core and Intact Polar Glycerol Dibiphytanyl Glycerol Tetraether Lipids of Ammonia-Oxidizing Archaea Enriched from Marine and Estuarine Sediments , 2011, Applied and Environmental Microbiology.

[85]  J. Kikuchi,et al.  Evaluation and Characterization of Bacterial Metabolic Dynamics with a Novel Profiling Technique, Real-Time Metabolotyping , 2009, PloS one.

[86]  M. Pignatelli,et al.  The oral metagenome in health and disease , 2011, The ISME Journal.

[87]  Jing Chen,et al.  Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource , 2010, Nucleic Acids Res..

[88]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[89]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[90]  Haixu Tang,et al.  An Orfome Assembly Approach to Metagenomics Sequences Analysis , 2009, J. Bioinform. Comput. Biol..

[91]  M. Borodovsky,et al.  Ab initio gene identification in metagenomic sequences , 2010, Nucleic acids research.

[92]  E. Want,et al.  Colonization-Induced Host-Gut Microbial Metabolic Interaction , 2011, mBio.

[93]  Fabian Schreiber,et al.  CoMet—a web server for comparative functional profiling of metagenomes , 2011, Nucleic Acids Res..

[94]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[95]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[96]  Adam Godzik,et al.  Shotgun metaproteomics of the human distal gut microbiota , 2008, The ISME Journal.

[97]  Gail L. Rosen,et al.  Combining gene prediction methods to improve metagenomic gene annotation , 2011, BMC Bioinformatics.

[98]  Alison S. Waller,et al.  Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data , 2012, PloS one.

[99]  N. Pace A molecular view of microbial diversity and the biosphere. , 1997, Science.

[100]  Natalia Ivanova,et al.  Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities , 2006, Nature Biotechnology.

[101]  M. David,et al.  Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw , 2011, Nature.

[102]  Naryttza N. Diaz,et al.  TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach , 2009, BMC Bioinformatics.

[103]  R. Weinshilboum,et al.  Metabolomics: a global biochemical approach to drug response and disease. , 2008, Annual review of pharmacology and toxicology.

[104]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[105]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[106]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[107]  Srinivasan Ramachandran,et al.  SPAAN: a software program for prediction of adhesins and adhesin-like proteins using neural networks , 2004, Bioinform..

[108]  S. Schuster,et al.  Integrative analysis of environmental sequences using MEGAN4. , 2011, Genome research.

[109]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[110]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[111]  M. Hattori,et al.  Bifidobacteria can protect from enteropathogenic infection through production of acetate , 2011, Nature.

[112]  P. Bork,et al.  Get the most out of your metagenome: computational analysis of environmental sequence data. , 2007, Current opinion in microbiology.

[113]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[114]  Naveen Kumar,et al.  MetaBioME: a database to explore commercially useful enzymes in metagenomic datasets , 2009, Nucleic Acids Res..

[115]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[116]  Wei E Huang,et al.  When single cell technology meets omics, the new toolbox of analytical biotechnology is emerging. , 2012, Current opinion in biotechnology.

[117]  A. Moya,et al.  Evaluating the Fidelity of De Novo Short Read Metagenomic Assembly Using Simulated Data , 2011, PloS one.

[118]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[119]  Miguel Pignatelli,et al.  Metatranscriptomic Approach to Analyze the Functional Human Gut Microbiota , 2011, PloS one.

[120]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[121]  J. Raes,et al.  Quantitative assessment of protein function prediction from metagenomics shotgun sequences , 2007, Proceedings of the National Academy of Sciences.

[122]  Sitao Wu,et al.  WebMGA: a customizable web server for fast metagenomic sequence analysis , 2011, BMC Genomics.

[123]  K. Nelson,et al.  Gene-centric metagenomics of the fiber-adherent bovine rumen microbiome reveals forage specific glycoside hydrolases , 2009, Proceedings of the National Academy of Sciences.