gcMeta: a Global Catalogue of Metagenomics platform to support the archiving, standardization and analysis of microbiome data

Abstract Meta-omics approaches have been increasingly used to study the structure and function of the microbial communities. A variety of large-scale collaborative projects are being conducted to encompass samples from diverse environments and habitats. This change has resulted in enormous demands for long-term data maintenance and capacity for data analysis. The Global Catalogue of Metagenomics (gcMeta) is a part of the ‘Chinese Academy of Sciences Initiative of Microbiome (CAS-CMI)’, which focuses on studying the human and environmental microbiome, establishing depositories of samples, strains and data, as well as promoting international collaboration. To accommodate and rationally organize massive datasets derived from several thousands of human and environmental microbiome samples, gcMeta features a database management system for archiving and publishing data in a standardized way. Another main feature is the integration of more than ninety web-based data analysis tools and workflows through a Docker platform which enables data analysis by using various operating systems. This platform has been rapidly expanding, and now hosts data from the CAS-CMI and a number of other ongoing research projects. In conclusion, this platform presents a powerful and user-friendly service to support worldwide collaborative efforts in the field of meta-omics research. This platform is freely accessible at https://gcmeta.wdcm.org/.

[1]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[2]  Jan Krüger,et al.  Analyzing large scale genomic data on the cloud with Sparkhit , 2017, Bioinform..

[3]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[4]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[5]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[6]  Andreas Wilke,et al.  The MG-RAST metagenomics database and portal in 2015 , 2015, Nucleic Acids Res..

[7]  Renzo Kottmann,et al.  Marine microbial biodiversity, bioinformatics and biotechnology (M2B3) data reporting and service standards , 2015, Standards in genomic sciences.

[8]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[9]  Yang Young Lu,et al.  VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data , 2017, Microbiome.

[10]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[11]  Arthur Brady,et al.  Strains, functions and dynamics in the expanded Human Microbiome Project , 2017, Nature.

[12]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[13]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[14]  Davide Albanese,et al.  Strain profiling and epidemiology of bacterial species from metagenomic sequencing , 2017, Nature Communications.

[15]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[16]  Peter F. Hallin,et al.  RNAmmer: consistent and rapid annotation of ribosomal RNA genes , 2007, Nucleic acids research.

[17]  Alexandros Stamatakis,et al.  Metagenomic species profiling using universal phylogenetic marker genes , 2013, Nature Methods.

[18]  Wing-Kin Sung,et al.  Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences , 2011, RECOMB.

[19]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[20]  Jie Ren,et al.  Alignment-free \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$d_2^*$\end{document} oligonucleotide frequency dissi , 2016, Nucleic acids research.

[21]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[22]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[23]  Yasukazu Nakamura,et al.  DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication , 2017, Bioinform..

[24]  Bernhard Y. Renard,et al.  DUDes: a top-down taxonomic profiler for metagenomics , 2016, Bioinform..

[25]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[26]  Hideaki Tanaka,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2011, BCB '11.

[27]  Jing Zhao,et al.  Bioinformatics tools for quantitative and functional metagenome and metatranscriptome data analysis in microbes , 2017, Briefings Bioinform..

[28]  C. Huttenhower,et al.  Metagenomic biomarker discovery and explanation , 2011, Genome Biology.

[29]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[30]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[31]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[32]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[33]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[34]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[35]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[36]  Connor T. Skennerton,et al.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[37]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[38]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[39]  Alejandro A. Schäffer,et al.  A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[40]  Luiz Irber,et al.  sourmash: a library for MinHash sketching of DNA , 2016, J. Open Source Softw..

[41]  S. Salzberg,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[42]  Patricia P. Chan,et al.  tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes , 2016, Nucleic Acids Res..

[43]  Jianhui Li,et al.  World data centre for microorganisms: an information infrastructure to explore and utilize preserved microbial strains worldwide , 2016, Nucleic Acids Res..

[44]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[45]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[46]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[47]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[48]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[49]  Torsten Seemann,et al.  Prokka: rapid prokaryotic genome annotation , 2014, Bioinform..

[50]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[51]  Dongwan D. Kang,et al.  MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities , 2015, PeerJ.

[52]  Bernard Henrissat,et al.  Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome , 2012, PLoS Comput. Biol..

[53]  Su Yao,et al.  The global catalogue of microorganisms 10K type strain sequencing project: closing the genomic gaps for the validly published prokaryotic and fungi species , 2018, GigaScience.

[54]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[55]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[56]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[57]  I-Min A. Chen,et al.  IMG/M: integrated genome and metagenome comparative data analysis system , 2016, Nucleic Acids Res..

[58]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[59]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[60]  Siu-Ming Yiu,et al.  Meta-IDBA: a de Novo assembler for metagenomic data , 2011, Bioinform..

[61]  Graziano Pesole,et al.  The metagenomic data life-cycle: standards and best practices , 2017, GigaScience.

[62]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[63]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[64]  Robert C. Edgar,et al.  PILER-CR: Fast and accurate identification of CRISPR repeats , 2007, BMC Bioinformatics.

[65]  Walter Pirovano,et al.  BIOINFORMATICS APPLICATIONS , 2022 .

[66]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[67]  Emily S. Charlson,et al.  Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications , 2011, Nature Biotechnology.

[68]  Anders Krogh,et al.  Fast and sensitive taxonomic classification for metagenomics with Kaiju , 2016, Nature Communications.

[69]  Sarah A. Butcher,et al.  k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets , 2016, Nucleic acids research.

[70]  J. Chun,et al.  OrthoANI: An improved algorithm and software for calculating average nucleotide identity. , 2016, International journal of systematic and evolutionary microbiology.

[71]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[72]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[73]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[74]  Jesse R. Zaneveld,et al.  Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences , 2013, Nature Biotechnology.

[75]  Rolf Apweiler,et al.  InterProScan - an integration platform for the signature-recognition methods in InterPro , 2001, Bioinform..

[76]  F. Raymond,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ray Meta: scalable de novo metagenome assembly and profiling , 2012 .

[77]  Fangqing Zhao,et al.  Composition-based classification of short metagenomic sequences elucidates the landscapes of taxonomic and functional enrichment of microorganisms , 2012, Nucleic acids research.

[78]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[79]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[80]  Robert D. Finn,et al.  EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies , 2017, Nucleic Acids Res..

[81]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[82]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[83]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[84]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[85]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[86]  Chris F. Taylor,et al.  The minimum information about a genome sequence (MIGS) specification , 2008, Nature Biotechnology.

[87]  Nicole Dubilier,et al.  Microbiology: Create a global microbiome effort , 2015, Nature.

[88]  Chris Mungall,et al.  The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation , 2016, Journal of Biomedical Semantics.

[89]  Georgios A. Pavlopoulos,et al.  Metagenomics: Tools and Insights for Analyzing Next-Generation Sequencing Data Derived from Biodiversity Studies , 2015, Bioinformatics and biology insights.

[90]  G. Cochrane,et al.  The Genomic Standards Consortium , 2011, PLoS biology.

[91]  Lior Pachter,et al.  Erratum: Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[92]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[93]  Natalia N. Ivanova,et al.  1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life , 2017, Nature Biotechnology.

[94]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[95]  N. Friedman,et al.  Trinity : reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2016 .

[96]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[97]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[98]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[99]  Natalia N. Ivanova,et al.  Microbiome Data Science: Understanding Our Microbial Planet. , 2016, Trends in microbiology.

[100]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[101]  Aaron M. Newman,et al.  XSTREAM: A practical algorithm for identification and architecture modeling of tandem repeats in protein sequences , 2007, BMC Bioinformatics.

[102]  Yadong Wang,et al.  PRISM: Pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants , 2012, Bioinform..

[103]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[104]  Alexey A. Gurevich,et al.  MetaQUAST: evaluation of metagenome assemblies , 2016, Bioinform..

[105]  Blake A. Simmons,et al.  MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets , 2016, Bioinform..

[106]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[107]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[108]  Alyssa C. Frazee,et al.  Ballgown bridges the gap between transcriptome assembly and expression analysis , 2015, Nature Biotechnology.

[109]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[110]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[111]  Andreas Wilke,et al.  A Platform-Independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE , 2012, PLoS Comput. Biol..

[112]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[113]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.