The metagenomic data life-cycle: standards and best practices

Abstract Metagenomics data analyses from independent studies can only be compared if the analysis workflows are described in a harmonized way. In this overview, we have mapped the landscape of data standards available for the description of essential steps in metagenomics: (i) material sampling, (ii) material sequencing, (iii) data analysis, and (iv) data archiving and publishing. Taking examples from marine research, we summarize essential variables used to describe material sampling processes and sequencing procedures in a metagenomics experiment. These aspects of metagenomics dataset generation have been to some extent addressed by the scientific community, but greater awareness and adoption is still needed. We emphasize the lack of standards relating to reporting how metagenomics datasets are analysed and how the metagenomics data analysis outputs should be archived and published. We propose best practice as a foundation for a community standard to enable reproducibility and better sharing of metagenomics datasets, leading ultimately to greater metagenomics data reuse and repurposing.

[1]  G. Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2011, Nucleic Acids Res..

[2]  Natalia N. Ivanova,et al.  The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4) , 2016, Standards in Genomic Sciences.

[3]  Robert D. Finn,et al.  EBI metagenomics in 2016 - an expanding and evolving resource for the analysis and archiving of metagenomic data , 2015, Nucleic Acids Res..

[4]  Raymond K. Auerbach,et al.  The real cost of sequencing: higher than you think! , 2011, Genome Biology.

[5]  Pelin Yilmaz,et al.  The genomic standards consortium: bringing standards to life for microbial ecology , 2011, The ISME Journal.

[6]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[7]  Edvard Pedersen,et al.  META-pipe - Pipeline Annotation, Analysis and Visualization of Marine Metagenomic Sequence Data , 2016, ArXiv.

[8]  Daniel P. Faith,et al.  Monitoring Changes in Genetic Diversity , 2017 .

[9]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[10]  I-Min A. Chen,et al.  IMG/M 4 version of the integrated metagenome comparative analysis system , 2013, Nucleic Acids Res..

[11]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[12]  Renzo Kottmann,et al.  Marine microbial biodiversity, bioinformatics and biotechnology (M2B3) data reporting and service standards , 2015, Standards in genomic sciences.

[13]  Peter Meinicke,et al.  UProC: tools for ultra-fast protein domain classification , 2014, Bioinform..

[14]  Rolf Apweiler,et al.  The Proteomics Standards Initiative , 2003, Proteomics.

[15]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[16]  Natalia N. Ivanova,et al.  The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4) , 2015, Standards in Genomic Sciences.

[17]  Guy Cochrane,et al.  European Nucleotide Archive in 2016 , 2016, Nucleic Acids Res..

[18]  Hilla Peretz,et al.  Ju n 20 03 Schrödinger ’ s Cat : The rules of engagement , 2003 .

[19]  Emily S. Charlson,et al.  Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications , 2011, Nature Biotechnology.

[20]  G. Cochrane,et al.  The Genomic Standards Consortium , 2011, PLoS biology.

[21]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.