MG-RAST version 4 - lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis

As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [Glass EM, Dribinsky Y, Yilmaz P, et al. ISME J 2014;8:1-3] is an example; we use existing well-studied data sets as gold standards representing different environments and different technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST as platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value, as these data sets are not presenting the same challenges as real-world data sets. In addition, the MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the community and our partners that will enable double barcoding, stronger inferences supported by longer-read technologies, and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the Common Workflow Language as a standard to specify bioinformatics workflows, both to facilitate development and efficient high-performance implementation of the community's data analysis tasks.

[1]  Robert D. Finn,et al.  EBI metagenomics in 2016 - an expanding and evolving resource for the analysis and archiving of metagenomic data , 2015, Nucleic Acids Res..

[2]  C. Titus Brown,et al.  Walking the Talk: Adopting and Adapting Sustainable Scientific Software Development processes in a Small Biology Lab , 2016, Journal of open research software.

[3]  Andreas Wilke,et al.  MIxS-BE: a MIxS extension defining a minimum information standard for sequence data from the built environment , 2013, The ISME Journal.

[4]  Andreas Wilke,et al.  The MG-RAST metagenomics database and portal in 2015 , 2015, Nucleic Acids Res..

[5]  Yong Dou,et al.  Families of FPGA-Based Accelerators for BLAST Algorithm with Multi-seeds Detection and Parallel Extension , 2008, BIRD.

[6]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[7]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[8]  Andreas Wilke,et al.  Short-read reading-frame predictors are not created equal: sequence error causes loss of signal , 2012, BMC Bioinformatics.

[9]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[10]  Andreas Wilke,et al.  An experience report: porting the MG‐RAST rapid metagenomics analysis pipeline to the cloud , 2011, Concurr. Comput. Pract. Exp..

[11]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[12]  Hélène Touzet,et al.  SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data , 2012, Bioinform..

[13]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[14]  Andreas Wilke,et al.  The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools , 2012, BMC Bioinformatics.

[15]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[16]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[17]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[18]  Donovan Parks,et al.  GroopM: an automated tool for the recovery of population genomes from related metagenomes , 2014, PeerJ.

[19]  Erik Aronesty,et al.  Comparison of Sequencing Utility Programs , 2013 .

[20]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[21]  Emily S. Charlson,et al.  Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications , 2011, Nature Biotechnology.

[22]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[23]  Renzo Kottmann,et al.  Genomic Standards Consortium Projects , 2014, Standards in genomic sciences.

[24]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[25]  Daniela Bartels,et al.  Annotation of Bacterial and Archaeal Genomes: Improving Accuracy and Consistency , 2007 .

[26]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[27]  Elizabeth M Glass,et al.  From genomics to metagenomics. , 2012, Current opinion in biotechnology.

[28]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[29]  I-Min A. Chen,et al.  IMG/M: integrated genome and metagenome comparative data analysis system , 2016, Nucleic Acids Res..

[30]  E. Plummer,et al.  A Comparison of Three Bioinformatics Pipelines for the Analysis ofPreterm Gut Microbiota using 16S rRNA Gene Sequencing Data , 2015 .

[31]  Maged M. Michael,et al.  Scale-up x Scale-out: A Case Study using Nutch/Lucene , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[32]  Barry Smith,et al.  The environment ontology: contextualising biological and biomedical entities , 2013, J. Biomed. Semant..

[33]  Andreas Wilke,et al.  Metazen – metadata capture for metagenomes , 2014, Standards in genomic sciences.

[34]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[35]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[36]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[37]  Dongwan D. Kang,et al.  MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities , 2015, PeerJ.

[38]  Daniel H Huson,et al.  Microbial community analysis using MEGAN. , 2013, Methods in enzymology.

[39]  Andreas Wilke,et al.  A RESTful API for Accessing Microbial Community Data for MG-RAST , 2015, PLoS Comput. Biol..

[40]  Tom O. Delmont,et al.  Anvi’o: an advanced analysis and visualization platform for ‘omics data , 2015, PeerJ.

[41]  Andreas Wilke,et al.  A Platform-Independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE , 2012, PLoS Comput. Biol..

[42]  Carole A. Goble,et al.  Why Linked Data is Not Enough for Scientists , 2010, 2010 IEEE Sixth International Conference on e-Science.

[43]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .

[44]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[45]  Mariana Vertenstein,et al.  An application-level parallel I/O library for Earth system models , 2012, Int. J. High Perform. Comput. Appl..