Metagenomics: Facts and Artifacts, and Computational Challenges

Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. By enabling an analysis of populations including many (so-far) unculturable and often unknown microbes, metagenomics is revolutionizing the field of microbiology, and has excited researchers in many disciplines that could benefit from the study of environmental microbes, including those in ecology, environmental sciences, and biomedicine. Specific computational and statistical tools have been developed for metagenomic data analysis and comparison. New studies, however, have revealed various kinds of artifacts present in metagenomics data caused by limitations in the experimental protocols and/or inadequate data analysis procedures, which often lead to incorrect conclusions about a microbial community. Here, we review some of the artifacts, such as overestimation of species diversity and incorrect estimation of gene family frequencies, and discuss emerging computational approaches to address them. We also review potential challenges that metagenomics may encounter with the extensive application of next-generation sequencing (NGS) techniques.

[1]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[2]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[3]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[4]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[5]  Daniel H. Huson,et al.  Visual and statistical comparison of metagenomes , 2009, Bioinform..

[6]  J. Handelsman,et al.  Introducing SONS, a Tool for Operational Taxonomic Unit-Based Comparisons of Microbial Community Memberships and Structures , 2006, Applied and Environmental Microbiology.

[7]  J. Handelsman,et al.  Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.

[8]  B. Andresen,et al.  Genomic analysis of uncultured marine viral communities , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  R. Knight,et al.  Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex , 2008, Nature Methods.

[10]  Michael Y. Galperin,et al.  Metagenomics: from acid mine to shining sea. , 2004, Environmental microbiology.

[11]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[12]  Mihai Pop,et al.  Genome assembly reborn: recent computational challenges , 2009, Briefings Bioinform..

[13]  Natalia N. Ivanova,et al.  Symbiosis insights through metagenomic analysis of a microbial consortium. , 2006, Nature Reviews Microbiology.

[14]  Thomas Huber,et al.  Bellerophon: a program to detect chimeric sequences in multiple sequence alignments , 2004, Bioinform..

[15]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[16]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[17]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[18]  Faye M. Rosin,et al.  Moonlighting vacuolar protease: multiple jobs for a busy protein. , 2005, Trends in plant science.

[19]  Karl-Erich Jaeger,et al.  Advances in Recovery of Novel Biocatalysts from Metagenomes , 2008, Journal of Molecular Microbiology and Biotechnology.

[20]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[21]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[22]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[23]  Maureen L. Coleman,et al.  Microbial community gene expression in ocean surface waters , 2008, Proceedings of the National Academy of Sciences.

[24]  Tracy K. Teal,et al.  Systematic artifacts in metagenomes from complex microbial communities , 2009, The ISME Journal.

[25]  J. Parkhill,et al.  Comparative genomic structure of prokaryotes. , 2004, Annual review of genetics.

[26]  Jo Handelsman,et al.  A statistical toolbox for metagenomics: assessing functional diversity in microbial communities , 2008, BMC Bioinformatics.

[27]  Jürgen Eck,et al.  Metagenomics and industrial applications , 2005, Nature Reviews Microbiology.

[28]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[29]  Haixu Tang,et al.  An Orfome Assembly Approach to Metagenomics Sequences Analysis , 2009, J. Bioinform. Comput. Biol..

[30]  R. Knight,et al.  Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. , 2009, Genome research.

[31]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[32]  J. Eisen,et al.  A simple, fast, and accurate method of phylogenomic inference , 2008, Genome Biology.

[33]  P. Bork,et al.  Environments shape the nucleotide composition of genomes , 2005, EMBO reports.

[34]  Ying Xu,et al.  Barcodes for genomes and applications , 2008, BMC Bioinformatics.

[35]  Jan O. Korbel,et al.  Quantifying environmental adaptation of metabolic pathways in metagenomics , 2009, Proceedings of the National Academy of Sciences.

[36]  Adam Godzik,et al.  Shotgun metaproteomics of the human distal gut microbiota , 2008, The ISME Journal.

[37]  J. Hughes,et al.  Counting the Uncountable: Statistical Approaches to Estimating Microbial Diversity , 2001, Applied and Environmental Microbiology.

[38]  S. Tringe,et al.  Comparative Metagenomics of Microbial Communities , 2004, Science.

[39]  Katharina J. Hoff,et al.  BMC Bioinformatics BioMed Central Methodology article Gene prediction in metagenomic fragments: A large scale machine , 2008 .

[40]  Peter Salamon,et al.  PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information , 2005, BMC Bioinformatics.

[41]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[42]  Daniel H. Huson,et al.  Methods for comparative metagenomics , 2009, BMC Bioinformatics.

[43]  A. J. Jones,et al.  New Screening Software Shows that Most Recent Large 16S rRNA Gene Clone Libraries Contain Chimeras , 2006, Applied and Environmental Microbiology.

[44]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[45]  R. Knight,et al.  Evolution of Mammals and Their Gut Microbes , 2008, Science.

[46]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[47]  Jaysheel D. Bhavsar,et al.  Metagenomics: Read Length Matters , 2008, Applied and Environmental Microbiology.

[48]  Daniela Bartels,et al.  Finding novel genes in bacterial communities isolated from the environment , 2006, ISMB.

[49]  P. Bork,et al.  Get the most out of your metagenome: computational analysis of environmental sequence data. , 2007, Current opinion in microbiology.

[50]  Peer Bork,et al.  KEGG Atlas mapping for global analysis of metabolic pathways , 2008, Nucleic Acids Res..

[51]  J. Gilbert,et al.  Detection of Large Numbers of Novel Sequences in the Metatranscriptomes of Complex Marine Microbial Communities , 2008, PloS one.

[52]  Rick L. Stevens,et al.  Functional metagenomic profiling of nine biomes , 2008, Nature.

[53]  Florent E. Angly,et al.  Microbial Ecology of Four Coral Atolls in the Northern Line Islands , 2008, PloS one.

[54]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[55]  A. Godzik,et al.  Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets , 2008, PloS one.

[56]  Naryttza N. Diaz,et al.  Phylogenetic classification of short environmental DNA fragments , 2008, Nucleic acids research.

[57]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[58]  Colin Hill,et al.  Functional and comparative metagenomic analysis of bile salt hydrolase activity in the human gut microbiome , 2008, Proceedings of the National Academy of Sciences.

[59]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[60]  Gregory A Petsko It is alive , 2008, Genome Biology.

[61]  Patrick D Schloss,et al.  Evaluating different approaches that test whether microbial communities have the same structure , 2008, The ISME Journal.

[62]  Kentaro Miyazaki,et al.  Metagenomic Screening for Bleomycin Resistance Genes , 2008, Applied and Environmental Microbiology.

[63]  Jean-Michel Claverie,et al.  Taxonomic distribution of large DNA viruses in the sea , 2008, Genome Biology.

[64]  P. Bork,et al.  A Molecular Study of Microbe Transfer between Distant Environments , 2008, PloS one.

[65]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[66]  Katharina J. Hoff,et al.  Orphelia: predicting genes in metagenomic sequencing reads , 2009, Nucleic Acids Res..

[67]  Adam P. Arkin,et al.  FastBLAST: Homology Relationships for Millions of Proteins , 2008, PloS one.

[68]  Andrew D Griffiths,et al.  Amplification of complex gene libraries by emulsion PCR , 2006, Nature Methods.

[69]  Ron Y. Pinter,et al.  A Statistical Framework for the Functional Analysis of Metagenomes , 2008, RECOMB.

[70]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[71]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[72]  S. Kravitz,et al.  CAMERA: A Community Resource for Metagenomics , 2007, PLoS biology.

[73]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[74]  Gene W. Tyson,et al.  Metatranscriptomics reveals unique microbial small RNAs in the ocean’s water column , 2009, Nature.

[75]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[76]  Yuzhen Ye,et al.  A Parsimony Approach to Biological Pathway Reconstruction/Inference for Genomes and Metagenomes , 2009, PLoS Comput. Biol..

[77]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[78]  William G. Mckendree,et al.  ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences , 2009, Nucleic acids research.

[79]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[80]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[81]  K. Nelson,et al.  Gene-centric metagenomics of the fiber-adherent bovine rumen microbiome reveals forage specific glycoside hydrolases , 2009, Proceedings of the National Academy of Sciences.

[82]  J. Tiedje,et al.  New tools for discovering and characterizing microbial diversity. , 2008, Current opinion in biotechnology.

[83]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[84]  D. Alland,et al.  A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. , 2007, Journal of microbiological methods.

[85]  Safiyh Taghavi,et al.  Bioprospecting metagenomes: glycosyl hydrolases for converting biomass , 2009, Biotechnology for biofuels.

[86]  Elaine R Mardis,et al.  Anticipating the $1,000 genome , 2006, Genome Biology.

[87]  P. Bork,et al.  Prediction of effective genome size in metagenomic samples , 2007, Genome Biology.

[88]  Rob Knight,et al.  Host-bacterial coevolution and the search for new drug targets. , 2008, Current opinion in chemical biology.

[89]  Jian Li,et al.  Estimation of tumor heterogeneity using CGH array data , 2009, BMC Bioinformatics.

[90]  A. J. Jones,et al.  At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies , 2005, Applied and Environmental Microbiology.

[91]  E. Marcotte,et al.  Computational genetics: finding protein function by nonhomology methods. , 2000, Current opinion in structural biology.