Microbial natural product databases: moving forward in the multi-omics era.

Covering: 2010-2020The digital revolution is driving significant changes in how people store, distribute, and use information. With the advent of new technologies around linked data, machine learning and large-scale network inference, the natural products research field is beginning to embrace real-time sharing and large-scale analysis of digitized experimental data. Databases play a key role in this, as they allow systematic annotation and storage of data for both basic and advanced applications. The quality of the content, structure, and accessibility of these databases all contribute to their usefulness for the scientific community in practice. This review covers the development of databases relevant for microbial natural product discovery during the past decade (2010-2020), including repositories of chemical structures/properties, metabolomics, and genomic data (biosynthetic gene clusters). It provides an overview of the most important databases and their functionalities, highlights some early meta-analyses using such databases, and discusses basic principles to enable widespread interoperability between databases. Furthermore, it points out conceptual and practical challenges in the curation and usage of natural products databases. Finally, the review closes with a discussion of key action points required for the field moving forward, not only for database developers but for any scientist active in the field.

[1]  Weiping Chen,et al.  NPASS: natural product activity and species source database for natural product research, discovery and tool development , 2017, Nucleic Acids Res..

[2]  H. Umezawa Index of antibiotics from actinomycetes , 1967 .

[3]  Tyler W. H. Backman,et al.  ClusterCAD: a computational platform for type I modular polyketide synthase design , 2017, Nucleic Acids Res..

[4]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[5]  Inna Dubchak,et al.  MycoCosm portal: gearing up for 1000 fungal genomes , 2013, Nucleic Acids Res..

[6]  Kai Blin,et al.  The antiSMASH database version 2: a comprehensive resource on secondary metabolite biosynthetic gene clusters , 2018, Nucleic Acids Res..

[7]  Falk Hildebrand,et al.  Structure and function of the global topsoil microbiome , 2018, Nature.

[8]  Kyle R. Conway,et al.  ClusterMine360: a database of microbial PKS/NRPS biosynthesis , 2012, Nucleic Acids Res..

[9]  Kai Blin,et al.  antiSMASH 2.0—a versatile platform for genome mining of secondary metabolite producers , 2013, Nucleic Acids Res..

[10]  Stefan Günther,et al.  StreptomeDB 2.0—an extended resource of natural products produced by streptomycetes , 2015, Nucleic Acids Res..

[11]  Douglas B Kell,et al.  Analysing and Navigating Natural Products Space for Generating Small, Diverse, But Representative Chemical Libraries , 2018, Biotechnology journal.

[12]  Brian C. Thomas,et al.  Novel soil bacteria possess diverse genes for secondary metabolite biosynthesis , 2018, Nature.

[13]  S. Lee,et al.  antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline , 2019, Nucleic Acids Res..

[14]  Elaina D. Graham,et al.  Descriptor : The reconstruction of 2 , 631 draft metagenome-assembled genomes from the global oceans , 2018 .

[15]  Chad W. Johnston,et al.  Dereplicating nonribosomal peptides using an informatic search algorithm for natural products (iSNAP) discovery , 2012, Proceedings of the National Academy of Sciences.

[16]  David S. Wishart,et al.  HMDB 4.0: the human metabolome database for 2018 , 2017, Nucleic Acids Res..

[17]  S. Böcker,et al.  Searching molecular structure databases with tandem mass spectra using CSI:FingerID , 2015, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Grimur Hjorleifsson Eldjarn,et al.  Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions , 2020, bioRxiv.

[19]  Christoph Steinbeck,et al.  NMRShiftDB -- compound identification and structure elucidation support through a free community-built web database. , 2004, Phytochemistry.

[20]  M. Medema,et al.  A standardized workflow for submitting data to the Minimum Information about a Biosynthetic Gene cluster (MIBiG) repository: prospects for research-based educational experiences , 2018, Standards in Genomic Sciences.

[21]  I-Min A. Chen,et al.  IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes , 2018, Nucleic Acids Res..

[22]  Roger G. Linington,et al.  Insights into Secondary Metabolism from a Global Analysis of Prokaryotic Biosynthetic Gene Clusters , 2014, Cell.

[23]  Justin J. J. van der Hooft,et al.  The Natural Products Atlas: An Open Access Knowledge Base for Microbial Natural Products Discovery , 2019, ACS central science.

[24]  Evan Bolton,et al.  PubChem 2019 update: improved access to chemical data , 2018, Nucleic Acids Res..

[25]  Christoph Steinbeck,et al.  ChEBI in 2016: Improved services and an expanding collection of metabolites , 2015, Nucleic Acids Res..

[26]  Marnix H. Medema,et al.  A computational framework to explore large-scale biosynthetic diversity , 2019, Nature Chemical Biology.

[27]  S. Brady,et al.  eSNaPD: a versatile, web-based bioinformatics platform for surveying and mining natural product biosynthetic diversity from metagenomes. , 2014, Chemistry & biology.

[28]  Mick Watson,et al.  Compendium of 4,941 rumen metagenome-assembled genomes for rumen microbiome biology and enzyme discovery , 2019, Nature Biotechnology.

[29]  Renzo Kottmann,et al.  The antiSMASH database, a comprehensive database of microbial secondary metabolite biosynthetic gene clusters , 2016, Nucleic Acids Res..

[30]  Liu Cao,et al.  Dereplication of microbial metabolites through database search of mass spectra , 2018, Nature Communications.

[31]  Trey Ideker,et al.  Using Functional Signature Ontology (FUSION) to Identify Mechanisms of Action for Natural Products , 2013, Science Signaling.

[32]  Annelien L Bredenoord,et al.  The FAIR guiding principles for data stewardship: fair enough? , 2018, European Journal of Human Genetics.

[33]  Lirong Chen,et al.  Use of Natural Products as Chemical Library for Drug Discovery and Network Pharmacology , 2013, PloS one.

[34]  Ryan A McClure,et al.  Metabologenomics: Correlation of Microbial Gene Clusters with Metabolites Drives Discovery of a Nonribosomal Peptide with an Unusual Amino Acid Monomer , 2016, ACS central science.

[35]  Robert Petryszak,et al.  UniChem: a unified chemical structure cross-referencing and identifier tracking system , 2013, Journal of Cheminformatics.

[36]  M. Kanehisa,et al.  BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. , 2016, Journal of molecular biology.

[37]  Emma L. Schymanski,et al.  MetFrag relaunched: incorporating strategies beyond in silico fragmentation , 2016, Journal of Cheminformatics.

[38]  D. Payne,et al.  Dictionary of antibiotics and related substances : with CD-ROM , 2013 .

[39]  Chad W. Johnston,et al.  Polyketide and nonribosomal peptide retro-biosynthesis and global gene cluster matching. , 2016, Nature chemical biology.

[40]  Neil L Kelleher,et al.  A Roadmap for Natural Product Discovery Based on Large-Scale Genomics and Metabolomics , 2014, Nature chemical biology.

[41]  Juho Rousu,et al.  Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions , 2020, bioRxiv.

[42]  Nobuyuki Fujita,et al.  DoBISCUIT: a database of secondary metabolite biosynthetic gene clusters , 2012, Nucleic Acids Res..

[43]  J. Badger,et al.  The Natural Product Domain Seeker NaPDoS: A Phylogeny Based Bioinformatic Tool to Classify Secondary Metabolite Gene Diversity , 2012, PloS one.

[44]  Joe Wandy,et al.  Topic modeling for untargeted substructure exploration in metabolomics , 2016, Proceedings of the National Academy of Sciences.

[45]  Kai Blin,et al.  antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences , 2011, Nucleic Acids Res..

[46]  E. Krishnan,et al.  Big Data and Clinicians: A Review on the State of the Science , 2014, JMIR medical informatics.

[47]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[48]  Robert D. Finn,et al.  A unified sequence catalogue of over 280,000 genomes obtained from the human gut microbiome , 2019, bioRxiv.

[49]  Jean-Marc Nuzillard,et al.  Correction: The value of universally available raw NMR data for transparency, reproducibility, and integrity in natural product research. , 2019, Natural product reports.

[50]  Kai Blin,et al.  Disclosing the Potential of the SARP-Type Regulator PapR2 for the Activation of Antibiotic Gene Clusters in Streptomycetes , 2020, Frontiers in Microbiology.

[51]  Peter Rodgers,et al.  eulerAPE: Drawing Area-Proportional 3-Venn Diagrams Using Ellipses , 2014, PloS one.

[52]  Lars Ridder,et al.  Deciphering complex metabolite mixtures by unsupervised and supervised substructure discovery and semi-automated annotation from MS/MS spectra. , 2019, Faraday discussions.

[53]  Kristian Fog Nielsen,et al.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking , 2016, Nature Biotechnology.

[54]  Barbara I. Adaikpoh,et al.  Survey of Biosynthetic Gene Clusters from Sequenced Myxobacteria Reveals Unexplored Biosynthetic Potential , 2019, Microorganisms.

[55]  Juho Rousu,et al.  SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information , 2019, Nature Methods.

[56]  Roberto Therón,et al.  NAPROC-13: a database for the dereplication of natural product mixtures in bioassay-guided protocols , 2007, Bioinform..

[57]  Ronald J. Quinn,et al.  Capturing Nature's Diversity , 2015, PloS one.

[58]  Kai Blin,et al.  antiSMASH 4.0—improvements in chemistry prediction and gene cluster boundary identification , 2017, Nucleic Acids Res..

[59]  Carla S. Jones,et al.  Minimum Information about a Biosynthetic Gene cluster. , 2015, Nature chemical biology.

[60]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[61]  H. H. Mao,et al.  A Convolutional Neural Network-Based Approach for the Rapid Characterization of Molecularly Diverse Natural Products. , 2020, Journal of the American Chemical Society.

[62]  Donovan H. Parks,et al.  Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life , 2017, Nature Microbiology.

[63]  Carin Li,et al.  CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification , 2019, Metabolites.

[64]  Kai Blin,et al.  antiSMASH 3.0—a comprehensive resource for the genome mining of biosynthetic gene clusters , 2015, Nucleic Acids Res..

[65]  William H. Gerwick,et al.  Retrospective analysis of natural products provides insights for future discovery trends , 2017, Proceedings of the National Academy of Sciences.

[66]  Maria Sorokina,et al.  Review on natural products databases: where to find data in 2020 , 2020, Journal of Cheminformatics.

[67]  Rolf Müller,et al.  Correlating chemical diversity with taxonomic distance for discovery of natural products in myxobacteria , 2018, Nature Communications.