Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS

Natural products (bio)synthesised by microbes are an important component of the pharmacopeia with a vast array of biomedical applications, in addition to their key role in many ecological interactions. One approach for the discovery of these metabolites is the identification of biosynthetic gene clusters (BGCs), genomic units which encode the molecular machinery required for producing the natural product. Genome mining has revolutionised the discovery of BGCs, yet metagenomic assemblies represent a largely untapped source of natural products. The imbalanced distribution of BGC classes in existing databases restricts the generalisation of detection patterns and limits the ability of mining methods to recognise a broader spectrum of BGCs. This problem is further intensified in metagenomic datasets, where BGC genes may be split across multiple contigs. This work presents SanntiS, a new machine learning-based approach for identifying BGCs. SanntiS achieved high precision and recall in both genomic and metagenomic datasets, effectively capturing a broad range of BGCs. Application of SanntiS to metagenomic assemblies found in MGnify led to a resource containing 1.1 million BGC predictions with associated contextual data from diverse biomes. Additionally, experimental validation of a previously undescribed BGC, detected solely by SanntiS, further demonstrates the potential of this approach in uncovering novel bioactive compounds. The study illustrates the significance of metagenomic datasets in comprehensively understanding the diversity and distribution of BGCs in microbial communities.

[1]  Lucy J. Colwell,et al.  MGnify: the microbiome sequence data analysis resource in 2023 , 2022, Nucleic Acids Res..

[2]  Thomas J. Booth,et al.  MIBiG 3.0: a community-driven effort to annotate experimentally validated biosynthetic gene clusters , 2022, Nucleic Acids Res..

[3]  Tom O. Delmont,et al.  Biosynthetic potential of the global ocean microbiome , 2022, Nature.

[4]  Donovan H. Parks,et al.  GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy , 2021, Nucleic Acids Res..

[5]  M. Zahoor,et al.  Application of natural antimicrobials in food preservation: Recent views , 2021 .

[6]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[7]  Karthik Raman,et al.  Modelling microbial communities: Harnessing consortia for biotechnological applications , 2021, Computational and structural biotechnology journal.

[8]  Alexander M. Kloosterman,et al.  antiSMASH 6.0: improving cluster detection and comparison capabilities , 2021, Nucleic Acids Res..

[9]  G. Zeller,et al.  Accurate de novo identification of biosynthetic gene clusters with GECCO , 2021, bioRxiv.

[10]  P. Sharma,et al.  PGPR Mediated Alterations in Root Traits: Way Toward Sustainable Crop Production , 2021, Frontiers in Sustainable Food Systems.

[11]  Vincent J. Denef,et al.  A genomic catalog of Earth’s microbiomes , 2020, Nature Biotechnology.

[12]  Silvio C. E. Tosatto,et al.  Pfam: The protein families database in 2021 , 2020, Nucleic Acids Res..

[13]  Justin J. J. van der Hooft,et al.  BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters , 2020, bioRxiv.

[14]  Ryosuke Yamanishi,et al.  Sound Event Detection Using Duration Robust Loss Function , 2020, ArXiv.

[15]  David J Newman,et al.  Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019. , 2020, Journal of natural products.

[16]  M. Rateb,et al.  Extreme environments: microbiology leading to specialized metabolites , 2020, Journal of applied microbiology.

[17]  D. Józefiak,et al.  Nisin as a Novel Feed Additive: The Effects on Gut Microbial Modulation and Activity, Histological Parameters, and Growth Performance of Broiler Chickens , 2020, Animals : an open access journal from MDPI.

[18]  Lingchong You,et al.  Emerging strategies for engineering microbial communities. , 2019, Biotechnology advances.

[19]  Roger G. Linington,et al.  MIBiG 2.0: a repository for biosynthetic gene clusters of known function , 2019, Nucleic Acids Res..

[20]  Danny A. Bitton,et al.  A deep learning genome-mining strategy for biosynthetic gene cluster prediction , 2019, Nucleic acids research.

[21]  Alpha A. Lee,et al.  Bayesian semi-supervised learning for uncertainty-calibrated prediction of molecular properties and active learning , 2019, Chemical science.

[22]  John A. Tallarico,et al.  Harnessing the Anti-Cancer Natural Product Nimbolide for Targeted Protein Degradation , 2018, bioRxiv.

[23]  Stanley B. Zdonik,et al.  Precision and Recall for Time Series , 2018, NeurIPS.

[24]  Zhiguo Yuan,et al.  Metagenomic analysis reveals wastewater treatment plants as hotspots of antibiotic resistance genes and mobile genetic elements. , 2017, Water research.

[25]  Michael A. Skinnider,et al.  PRISM 3: expanded prediction of natural product chemical structures from microbial genomes , 2017, Nucleic Acids Res..

[26]  Jeroen S. Dickschat,et al.  The Ecological Role of Volatile and Soluble Secondary Metabolites Produced by Soil Bacteria. , 2017, Trends in microbiology.

[27]  R. Edwards,et al.  Microcins mediate competition among Enterobacteriaceae in the inflamed gut , 2016, Nature.

[28]  M. Ranjan,et al.  Laterosporulin10: a novel defensin like Class IId bacteriocin from Brevibacillus sp. strain SKDU10 with inhibitory activity against microbial pathogens. , 2016, Microbiology.

[29]  Christopher J. Schwalen,et al.  A new genome-mining tool redefines the lasso peptide biosynthetic landscape , 2016, Nature chemical biology.

[30]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[31]  Annamaria Mesaros,et al.  Metrics for Polyphonic Sound Event Detection , 2016 .

[32]  Johannes Söding,et al.  MMseqs software suite for fast and deep clustering and searching of large protein sequence sets , 2016, Bioinform..

[33]  Neetika Nath,et al.  CASSIS and SMIPS: promoter-based prediction of secondary metabolite gene clusters in eukaryotic genomes , 2015, Bioinform..

[34]  Gisbert Schneider,et al.  Active-learning strategies in computer-assisted drug discovery. , 2015, Drug discovery today.

[35]  Jed A. Fuhrman,et al.  Marine microbial community dynamics and their ecological interpretation , 2015, Nature Reviews Microbiology.

[36]  S. Korpole,et al.  The intramolecular disulfide‐stapled structure of laterosporulin, a class IId bacteriocin, conceals a human defensin‐like structural module , 2015, The FEBS journal.

[37]  Peter Cimermancic,et al.  A Systematic Analysis of Biosynthetic Gene Clusters in the Human Microbiome Reveals a Common Family of Antibiotics , 2014, Cell.

[38]  Roger G. Linington,et al.  Insights into Secondary Metabolism from a Global Analysis of Prokaryotic Biosynthetic Gene Clusters , 2014, Cell.

[39]  Krystle L. Chavarria,et al.  Diversity and evolution of secondary metabolism in the marine actinomycete genus Salinispora , 2014, Proceedings of the National Academy of Sciences.

[40]  Matthew Fraser,et al.  InterProScan 5: genome-scale protein function classification , 2014, Bioinform..

[41]  Ashish,et al.  Identification, Purification and Characterization of Laterosporulin, a Novel Bacteriocin Produced by Brevibacillus sp. Strain GI-9 , 2012, PloS one.

[42]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[43]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[44]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[45]  P. Shannon,et al.  Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks , 2003 .

[46]  M. R. Brito,et al.  Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .