NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language

BackgroundShotgun metagenomes contain a sample of all the genomic material in an environment, allowing for the characterization of a microbial community. In order to understand these communities, bioinformatics methods are crucial. A common first step in processing metagenomes is to compute abundance estimates of different taxonomic or functional groups from the raw sequencing data.Given the breadth of the field, computational solutions need to be flexible and extensible, enabling the combination of different tools into a larger pipeline.ResultsWe present NGLess and NG-meta-profiler. NGLess is a domain specific language for describing next-generation sequence processing pipelines. It was developed with the goal of enabling user-friendly computational reproducibility. It provides built-in support for many common operations on sequencing data and is extensible with external tools with configuration files.Using this framework, we developed NG-meta-profiler, a fast profiler for metagenomes which performs sequence preprocessing, mapping to bundled databases, filtering of the mapping results, and profiling (taxonomic and functional). It is significantly faster than either MOCAT2 or htseq-count and (as it builds on NGLess) its results are perfectly reproducible.ConclusionsNG-meta-profiler is a high-performance solution for metagenomics processing built on NGLess. It can be used as-is to execute standard analyses or serve as the starting point for customization in a perfectly reproducible fashion.NGLess and NG-meta-profiler are open source software (under the liberal MIT license) and can be downloaded from https://ngless.embl.de or installed through bioconda.

[1]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[2]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[3]  C. Deming,et al.  Topographical and Temporal Diversity of the Human Skin Microbiome , 2009, Science.

[4]  Xun Xu,et al.  A reference gene catalogue of the pig gut microbiome , 2016, Nature Microbiology.

[5]  Michael R. Kosorok,et al.  Detection of gene pathways with predictive power for breast cancer prognosis , 2010, BMC Bioinformatics.

[6]  Jason A. Papin,et al.  Ten simple rules for biologists learning to program , 2018, PLoS Comput. Biol..

[7]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[8]  Ian M. Mitchell,et al.  Best Practices for Scientific Computing , 2012, PLoS biology.

[9]  Peer Bork,et al.  Similarity of the dog and human gut microbiomes in gene content and response to diet , 2018, Microbiome.

[10]  Jens Roat Kultima,et al.  An integrated catalog of reference genes in the human gut microbiome , 2014, Nature Biotechnology.

[11]  Rob Knight,et al.  Current understanding of the human microbiome , 2018, Nature Medicine.

[12]  T. R. Licht,et al.  A catalog of the mouse gut metagenome , 2015, Nature Biotechnology.

[13]  Alessandra Carbone,et al.  A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling , 2018, Microbiome.

[14]  Matthias Schwab,et al.  Making scientific computations reproducible , 2000, Comput. Sci. Eng..

[15]  Peer Bork,et al.  MOCAT: A Metagenomics Assembly and Gene Prediction Toolkit , 2012, PloS one.

[16]  M. Morgante,et al.  An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis , 2013, PloS one.

[17]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[18]  Luis Pedro Coelho Jug: Software for Parallel Reproducible Computation in Python , 2017 .

[19]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[20]  Michael R. Crusoe,et al.  Common Workflow Language , 2015 .

[21]  Peer Bork,et al.  MOCAT2: a metagenomic assembly, annotation and profiling framework , 2016, Bioinform..

[22]  Davide Heller,et al.  eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences , 2015, Nucleic Acids Res..

[23]  Peer Bork,et al.  Microbial abundance, activity and population genomic profiling with mOTUs2 , 2019, Nature Communications.

[24]  Sergey Fomel,et al.  Reproducible Research as a Community Effort: Lessons from the Madagascar Project , 2015, Computing in Science & Engineering.

[25]  Yuri Pirola,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2017, Nature Methods.

[26]  P. Bork,et al.  Accurate and universal delineation of prokaryotic species , 2013, Nature Methods.

[27]  C. Huttenhower,et al.  Metagenomic biomarker discovery and explanation , 2011, Genome Biology.

[28]  Alexandros Stamatakis,et al.  Metagenomic species profiling using universal phylogenetic marker genes , 2013, Nature Methods.

[29]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[30]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[31]  Harald Barsnes,et al.  BioContainers: an open-source and community-driven framework for software standardization , 2017, Bioinform..

[32]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[33]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[34]  John Chilton,et al.  Common Workflow Language, v1.0 , 2016 .

[35]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[36]  Scott Lathrop,et al.  Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis , 2011, International Conference on High Performance Computing.

[37]  Davide Heller,et al.  eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses , 2018, Nucleic Acids Res..

[38]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[39]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[40]  Jens Roat Kultima,et al.  Potential of fecal microbiota for early‐stage detection of colorectal cancer , 2014 .

[41]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[42]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[43]  Arian Maleki,et al.  Reproducible Research in Computational Harmonic Analysis , 2009, Computing in Science & Engineering.

[44]  C. Brown,et al.  Evaluating Metagenome Assembly on a Simple Defined Community with Many Strain Variants , 2017, bioRxiv.

[45]  Daniel D. Sommer,et al.  MetAMOS: a modular and open source metagenomic assembly and analysis pipeline , 2013, Genome Biology.

[46]  Patrick J. Biggs,et al.  SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data , 2010, BMC Bioinformatics.

[47]  Wilhelm Hasselbring,et al.  Effectiveness and efficiency of a domain-specific language for high-performance marine ecosystem simulation: a controlled experiment , 2016, Empirical Software Engineering.

[48]  Philip Miller,et al.  BiGG Models: A platform for integrating, standardizing and sharing genome-scale models , 2015, Nucleic Acids Res..

[49]  Feng Liu,et al.  A survey of the practice of computational science , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[50]  A. Heintz‐Buschart,et al.  IMP: a pipeline for reproducible reference-independent integrated metagenomic and metatranscriptomic analyses , 2016, Genome Biology.

[51]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[52]  Luis Pedro Coelho,et al.  Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper , 2016, bioRxiv.

[53]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[54]  P. Bork,et al.  The Human Gut Microbiome: From Association to Modulation , 2018, Cell.