A critical assessment of gene catalogs for metagenomic analysis

Abstract Motivation Microbial gene catalogs are data structures that organize genes found in microbial communities, providing a reference for standardized analysis of the microbes across samples and studies. Although gene catalogs are commonly used, they have not been critically evaluated for their effectiveness as a basis for metagenomic analyses. Results As a case study, we investigate one such catalog, the Integrated Gene Catalog (IGC), however, our observations apply broadly to most gene catalogs constructed to date. We focus on both the approach used to construct this catalog and on its effectiveness when used as a reference for microbiome studies. Our results highlight important limitations of the approach used to construct the IGC and call into question the broad usefulness of gene catalogs more generally. We also recommend best practices for the construction and use of gene catalogs in microbiome studies and highlight opportunities for future research. Availability and implementation All supporting scripts for our analyses can be found on GitHub: https://github.com/SethCommichaux/IGC.git. The supporting data can be downloaded from: https://obj.umiacs.umd.edu/igc-analysis/IGC_analysis_data.tar.gz. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Robert D. Finn,et al.  A unified catalog of 204,938 reference genomes from the human gut microbiome , 2020, Nature Biotechnology.

[2]  Vineet K. Sharma,et al.  The Gene Catalog and Comparative Analysis of Gut Microbiome of Big Cats Provide New Insights on Panthera Species , 2020, Frontiers in Microbiology.

[3]  Huan Liu,et al.  The preceding root system drives the composition and function of the rhizosphere microbiome , 2020, Genome Biology.

[4]  G. Liu,et al.  Structural and Functional Characteristics of the Microbiome in Deep-Dentin Caries , 2020, Journal of dental research.

[5]  K. Kavousi,et al.  Co-abundance analysis reveals hidden players associated with high methane yield phenotype in sheep rumen microbiome , 2020, Scientific Reports.

[6]  J. Ravel,et al.  A comprehensive non-redundant gene catalog reveals extensive within-community intraspecies diversity in the human vagina , 2020, Nature Communications.

[7]  Shenghui Li,et al.  Characterization of the Pig Gut Microbiome and Antibiotic Resistome in Industrialized Feedlots in China , 2019, mSystems.

[8]  Xiaoshu Cheng,et al.  Changes in Gut Microbiome Structure and Function of Rats with Isoproterenol-Induced Heart Failure. , 2019, International heart journal.

[9]  Jiali Gu,et al.  Metagenomic sequencing reveals microbial gene catalogue of phosphinothricin-utilized soils in South China. , 2019, Gene.

[10]  Shuaicheng Li,et al.  An integrated respiratory microbial gene catalogue to better understand the microbial aetiology of Mycoplasma pneumoniae pneumonia , 2019, GigaScience.

[11]  J. Estellé,et al.  A catalog of microbial genes from the bovine rumen unveils a specialized and diverse biomass-degrading environment , 2019, bioRxiv.

[12]  Benjamin J. Callahan,et al.  Consistent and correctable bias in metagenomic sequencing experiments , 2019, bioRxiv.

[13]  Laure Ségurel,et al.  Use of shotgun metagenomics for the identification of protozoa in the gut microbiota of healthy individuals from worldwide populations with various industrialization levels , 2019, PloS one.

[14]  J. Scaria,et al.  The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches , 2019, GigaScience.

[15]  A. Mchardy,et al.  An integrated metagenome catalog reveals novel insights into the murine gut microbiome , 2019, bioRxiv.

[16]  Philipp C. Münch,et al.  Genomic variation and strain-specific functional adaptation in the human gut microbiome during early life , 2018, Nature Microbiology.

[17]  Wei Fan,et al.  The chicken gut metagenome and the modulatory effects of plant-derived benzylisoquinoline alkaloids , 2018, Microbiome.

[18]  Daniel J. Nasko,et al.  RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification , 2018, Genome Biology.

[19]  Suisha Liang,et al.  Establishment of a Macaca fascicularis gut microbiome gene catalog and comparison with the human, pig, and mouse gut microbiomes , 2018, GigaScience.

[20]  Huijue Jia,et al.  A gene catalogue of the Sprague-Dawley rat gut metagenome , 2018, GigaScience.

[21]  Luis Pedro Coelho,et al.  Similarity of the dog and human gut microbiomes in gene content and response to diet , 2018, Microbiome.

[22]  Mihai Pop,et al.  Outlier detection in BLAST hits , 2018, Algorithms for Molecular Biology.

[23]  A. Kurilshikov,et al.  Environment dominates over host genetics in shaping human gut microbiota , 2018, Nature.

[24]  Laurence Zitvogel,et al.  Gut microbiome influences efficacy of PD-1–based immunotherapy against epithelial tumors , 2018, Science.

[25]  Frédéric Magoulès,et al.  MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun metagenomic data , 2017, bioRxiv.

[26]  Venkata P. Satagopam,et al.  Confronting the catalytic dark matter encoded by sequenced genomes , 2017, Nucleic acids research.

[27]  A. Blocker,et al.  How Do the Virulence Factors of Shigella Work Together to Cause Disease? , 2017, Front. Cell. Infect. Microbiol..

[28]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[29]  T. Spector,et al.  Shotgun Metagenomics of 250 Adult Twins Reveals Genetic and Environmental Impacts on the Gut Microbiome. , 2016, Cell systems.

[30]  K. Konstantinidis,et al.  Anthropogenic effects on bacterial diversity and function along a river-to-estuary gradient in Northwest Greece revealed by metagenomics. , 2016, Environmental microbiology.

[31]  Xun Xu,et al.  A reference gene catalogue of the pig gut microbiome , 2016, Nature Microbiology.

[32]  J. Estellé,et al.  P1016 The pig’s other genome: A reference gene catalog of the gut microbiome as a new resource for deep studies of the interplay between the host and its microbiome. , 2016 .

[33]  Jun Wang,et al.  Metagenome-wide association studies: fine-mining the microbiome , 2016, Nature Reviews Microbiology.

[34]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[35]  Mihai Pop,et al.  A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity , 2016, npj Biofilms and Microbiomes.

[36]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[37]  D. Vugia,et al.  Shiga Toxin 1–Producing Shigella sonnei Infections, California, United States, 2014–2015 , 2016, Emerging infectious diseases.

[38]  Ruth Timme,et al.  Practical Value of Food Pathogen Traceability through Building a Whole-Genome Sequencing Network and Database , 2016, Journal of Clinical Microbiology.

[39]  Jiachao Zhang,et al.  Intestinal Microbiota Distinguish Gout Patients from Healthy Humans , 2016, Scientific Reports.

[40]  Jun Wang,et al.  Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota , 2015, Nature.

[41]  E. Segal,et al.  Personalized Nutrition by Prediction of Glycemic Responses , 2015, Cell.

[42]  T. R. Licht,et al.  A catalog of the mouse gut metagenome , 2015, Nature Biotechnology.

[43]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[44]  V. Tremaroli,et al.  Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life. , 2015, Cell host & microbe.

[45]  A. Siitonen,et al.  Characterization of Shigella sonnei Isolate Carrying Shiga Toxin 2–Producing Gene , 2015, Emerging infectious diseases.

[46]  M. Juhas Horizontal gene transfer in human pathogens , 2015, Critical reviews in microbiology.

[47]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[48]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[49]  Beiwen Zheng,et al.  Alterations of the human gut microbiome in liver cirrhosis , 2014, Nature.

[50]  Jens Roat Kultima,et al.  Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes , 2014, Nature Biotechnology.

[51]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[52]  Jens Roat Kultima,et al.  An integrated catalog of reference genes in the human gut microbiome , 2014, Nature Biotechnology.

[53]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[54]  Qiang Feng,et al.  A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.

[55]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[56]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[57]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[58]  Ting Chen,et al.  Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering , 2011, Bioinform..

[59]  Robert C. Edgar,et al.  Search and clustering orders of magnitude faster than BLAST , 2010, Bioinform..

[60]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[61]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[62]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[63]  Mihai Pop,et al.  Finding Biologically Accurate Clusterings in Hierarchical Tree Decompositions Using the Variation of Information , 2009, J. Comput. Biol..

[64]  Mihai Pop,et al.  Alignment and clustering of phylogenetic markers - implications for microbial diversity studies , 2010, BMC Bioinformatics.

[65]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[66]  Benjamin J. Raphael,et al.  The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families , 2007, PLoS biology.

[67]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[68]  P. Reeves,et al.  When does a clone deserve a name? A perspective on bacterial species based on population genetics. , 2001, Trends in microbiology.

[69]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[70]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[71]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.