Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads

To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research.

[1]  Fredrik H. Karlsson,et al.  Gut metagenome in European women with normal, impaired and diabetic glucose control , 2013, Nature.

[2]  Peter Williams,et al.  IMG: the integrated microbial genomes database and comparative analysis system , 2011, Nucleic Acids Res..

[3]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[4]  E. Koonin,et al.  Functional and evolutionary implications of gene orthology , 2013, Nature Reviews Genetics.

[5]  Natalia N. Ivanova,et al.  A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea , 2009, Nature.

[6]  Qiang Feng,et al.  A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.

[7]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[8]  John C. Wooley,et al.  Ultrafast clustering algorithms for metagenomic sequence analysis , 2012, Briefings Bioinform..

[9]  Niels W. Hanson,et al.  Genomic properties of Marine Group A bacteria indicate a role in the marine sulfur cycle , 2013, The ISME Journal.

[10]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[11]  Christopher S. Oehmen,et al.  ScalaBLAST 2.0: rapid and robust BLAST calculations on multiprocessor systems , 2013, Bioinform..

[12]  Jaysheel D. Bhavsar,et al.  Metagenomics: Read Length Matters , 2008, Applied and Environmental Microbiology.

[13]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[14]  P. Hugenholtz,et al.  Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes , 2013, Nature Biotechnology.

[15]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[16]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[17]  Alexandros Stamatakis,et al.  Metagenomic species profiling using universal phylogenetic marker genes , 2013, Nature Methods.

[18]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[19]  Martin Kircher,et al.  Improved base calling for the Illumina Genome Analyzer using machine learning strategies , 2009, Genome Biology.

[20]  J. Foster,et al.  Relaxed Neighbor Joining: A Fast Distance-Based Phylogenetic Tree Construction Method , 2006, Journal of Molecular Evolution.

[21]  Rob Knight,et al.  Ribosomal RNA diversity predicts genome diversity in gut bacteria and their relatives , 2010, Nucleic acids research.

[22]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[23]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[24]  Quan Zhang,et al.  Artificial Functional Difference Between Microbial Communities Caused by Length Difference of Sequencing Reads , 2011, Pacific Symposium on Biocomputing.

[25]  R. Knight,et al.  Evolution of Mammals and Their Gut Microbes , 2008, Science.

[26]  Rob Knight,et al.  PyNAST: a flexible tool for aligning sequences to a template alignment , 2009, Bioinform..

[27]  Daniel D. Sommer,et al.  MetAMOS: a modular and open source metagenomic assembly and analysis pipeline , 2013, Genome Biology.

[28]  A. Halpern,et al.  Stalking the Fourth Domain in Metagenomic Data: Searching for, Discovering, and Interpreting Novel, Deep Branches in Marker Gene Phylogenetic Trees , 2011, PloS one.

[29]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[30]  Natalia N. Ivanova,et al.  Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[31]  Peer Bork,et al.  MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit , 2012, PloS one.

[32]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[33]  Brian C. Thomas,et al.  Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization , 2013, Genome research.

[34]  Mihai Pop,et al.  MetaPath: identifying differentially abundant metabolic pathways in metagenomic datasets , 2011, BMC proceedings.

[35]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[36]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[37]  Sean D. Hooper,et al.  Annotation of metagenome short reads using proxygenes , 2008, ECCB.

[38]  Elhanan Borenstein,et al.  Reconstructing the Genomic Content of Microbiome Taxa through Shotgun Metagenomic Deconvolution , 2013, PLoS Comput. Biol..

[39]  I-Min A. Chen,et al.  IMG/M: the integrated metagenome data management and comparative analysis system , 2011, Nucleic Acids Res..

[40]  S. Schuster,et al.  Integrative analysis of environmental sequences using MEGAN4. , 2011, Genome research.

[41]  J. Clemente,et al.  Human gut microbiome viewed across age and geography , 2012, Nature.

[42]  Bernard Henrissat,et al.  Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome , 2012, PLoS Comput. Biol..

[43]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[44]  R. Knight,et al.  The Effect of Diet on the Human Gut Microbiome: A Metagenomic Analysis in Humanized Gnotobiotic Mice , 2009, Science Translational Medicine.

[45]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[46]  Sharon I. Greenblum,et al.  Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease , 2011, Proceedings of the National Academy of Sciences.

[47]  Eric Becker,et al.  mBLAST: Keeping up with the sequencing explosion for (meta)genome analysis. , 2013, Journal of data mining in genomics & proteomics.

[48]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[49]  J. Eisen,et al.  Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups , 2013, PloS one.

[50]  David R. Kelley,et al.  Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering , 2011, Nucleic acids research.

[51]  Mihai Pop,et al.  Bioinformatics for the Human Microbiome Project , 2012, PLoS Comput. Biol..

[52]  Elhanan Borenstein,et al.  Towards a predictive systems-level model of the human microbiome: progress, challenges, and opportunities. , 2013, Current opinion in biotechnology.

[53]  E. Borenstein,et al.  Metabolic modeling of species interaction in the human microbiome elucidates community-level assembly rules , 2013, Proceedings of the National Academy of Sciences.

[54]  Andreas Wilke,et al.  The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools , 2012, BMC Bioinformatics.

[55]  Curtis Huttenhower,et al.  Biodiversity and functional genomics in the human microbiome. , 2013, Trends in genetics : TIG.