DOE JGI Metagenome Workflow

The DOE JGI Metagenome Workflow performs metagenome data processing, including assembly, structural, functional, and taxonomic annotation, and binning of metagenomic datasets that are subsequently included into the Integrated Microbial Genomes and Microbiomes (IMG/M) comparative analysis system (I. Chen, K. Chu, K. Palaniappan, M. Pillay, A. Ratner, J. Huang, M. Huntemann, N. Varghese, J. White, R. Seshadri, et al, Nucleic Acids Rsearch, 2019) and provided for download via the Joint Genome Institute (JGI) Data Portal (https://genome.jgi.doe.gov/portal/). This workflow scales to run on thousands of metagenome samples per year, which can vary by the complexity of microbial communities and sequencing depth. Here we describe the different tools, databases, and parameters used at different steps of the workflow, to help with interpretation of metagenome data available in IMG and to enable researchers to apply this workflow to their own data. We use 20 publicly available sediment metagenomes to illustrate the computing requirements for the different steps and highlight the typical results of data processing. The workflow modules for read filtering and metagenome assembly are available as a Workflow Description Language (WDL) file (https://code.jgi.doe.gov/BFoster/jgi_meta_wdl.git). The workflow modules for annotation and binning are provided as a service to the user community at https://img.jgi.doe.gov/submit and require filling out the project and associated metadata descriptions in Genomes OnLine Database (GOLD) (S. Mukherjee, D. Stamatis, J. Bertsch, G. Ovchinnikova, H. Katta, A. Mojica, I Chen, and N. Kyrpides, and T. Reddy, Nucleic Acids Research, 2018). IMPORTANCE The DOE JGI Metagenome Workflow is designed for processing metagenomic datasets starting from Illumina fastq files. It performs data pre-processing, error correction, assembly, structural and functional annotation, and binning. The results of processing are provided in several standard formats, such as fasta and gff and can be used for subsequent integration into the Integrated Microbial Genome (IMG) system where they can be compared to a comprehensive set of publicly available metagenomes. As of 7/30/2020 7,155 JGI metagenomes have been processed by the JGI Metagenome Workflow.

[1]  Norman R. Pace,et al.  Specific Ribosomal DNA Sequences from Diverse Environmental Settings Correlate with Experimental Contaminants , 1998, Applied and Environmental Microbiology.

[2]  Donovan H Parks,et al.  GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database , 2019, Bioinform..

[3]  Ian Sillitoe,et al.  CATH: expanding the horizons of structure-based functional annotations for genome sequences , 2018, Nucleic Acids Res..

[4]  Peer Bork,et al.  20 years of the SMART protein domain annotation resource , 2017, Nucleic Acids Res..

[5]  D C White,et al.  Indigenous and contaminant microbes in ultradeep mines. , 2003, Environmental microbiology.

[6]  Erin Beck,et al.  TIGRFAMs and Genome Properties in 2013 , 2012, Nucleic Acids Res..

[7]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[8]  Jennifer Lu,et al.  Improved metagenomic analysis with Kraken 2 , 2019, Genome Biology.

[9]  I-Min A. Chen,et al.  Genomes OnLine database (GOLD) v.7: updates and new features , 2018, Nucleic Acids Res..

[10]  Connor T. Skennerton,et al.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[11]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[12]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[13]  Patricia P. Chan,et al.  tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences. , 2019, Methods in molecular biology.

[14]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[15]  Natalia N. Ivanova,et al.  Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea , 2017, Nature Biotechnology.

[16]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[17]  Alex Copeland,et al.  Shotgun metagenomic analysis of microbial communities from the Loxahatchee nature preserve in the Florida Everglades , 2020, Environmental Microbiome.

[18]  Sean R. Eddy,et al.  Infernal 1.1: 100-fold faster RNA homology searches , 2013, Bioinform..

[19]  William Arndt,et al.  Modifying HMMER3 to Run Efficiently on the Cori Supercomputer Using OpenMP Tasking , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[21]  Nikos Kyrpides,et al.  CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats , 2007, BMC Bioinformatics.

[22]  I-Min A. Chen,et al.  IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes , 2018, Nucleic Acids Res..

[23]  I-Min A. Chen,et al.  The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities , 2020, Nucleic Acids Res..

[24]  Fernando Puente-Sánchez,et al.  SqueezeMeta, A Highly Portable, Fully Automatic Metagenomic Analysis Pipeline , 2018, bioRxiv.

[25]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[26]  Robert D. Finn,et al.  Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families , 2017, Nucleic Acids Res..

[27]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[28]  J. O’Hanlon,et al.  Analysis of Bacteria Contaminating Ultrapure Water in Industrial Systems , 2002, Applied and Environmental Microbiology.

[29]  Elizabeth M Glass,et al.  MG-RAST, a Metagenomics Service for Analysis of Microbial Community Structure and Function. , 2016, Methods in molecular biology.

[30]  Robert D. Finn,et al.  MGnify: the microbiome analysis resource in 2020 , 2019, Nucleic Acids Res..

[31]  S. Eddy,et al.  Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions , 2013, Nucleic acids research.

[32]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[33]  I-Min A. Chen,et al.  Genomes OnLine Database (GOLD) v.8: overview and updates , 2020, Nucleic Acids Res..

[34]  Luke R. Thompson,et al.  Species-level functional profiling of metagenomes and metatranscriptomes , 2018, Nature Methods.

[35]  Michael Y. Galperin,et al.  Expanded microbial genome coverage and improved protein family annotation in the COG database , 2014, Nucleic Acids Res..

[36]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[37]  Dongwan D. Kang,et al.  MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities , 2015, PeerJ.

[38]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[39]  M. Borodovsky,et al.  Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes , 2018, Genome research.