ZODIAC: database-independent molecular formula annotation using Gibbs sampling reveals unknown small molecules

The confident high-throughput identification of small molecules remains one of the most challenging tasks in mass spectrometry-based metabolomics. SIRIUS has become a powerful tool for the interpretation of tandem mass spectra, and shows outstanding performance for identifying the molecular formula of a query compound, being the first step of structure identification. Nevertheless, the identification of both molecular formulas for large compounds above 500 Daltons and novel molecular formulas remains highly challenging. Here, we present ZODIAC, a network-based algorithm for the de novo estimation of molecular formulas. ZODIAC reranks SIRIUS’ molecular formula candidates, combining fragmentation tree computation with Bayesian statistics using Gibbs sampling. Through careful algorithm engineering, ZODIAC’s Gibbs sampling is very swift in practice. ZODIAC decreases incorrect annotations 16.2-fold on a challenging plant extract dataset with most compounds above 700 Dalton; we then show improvements on four additional, diverse datasets. Our analysis led to the discovery of compounds with novel molecular formulas such as C24H47BrNO8P which, as of today, is not present in any publicly available molecular structure databases.

[1]  Sebastian Böcker,et al.  Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints , 2018, Bioinform..

[2]  S. A. Rod'kina,et al.  Fatty Acids and Other Lipids of Marine Sponges , 2006, Russian Journal of Marine Biology.

[3]  Rainer Breitling,et al.  MetAssign: probabilistic annotation of metabolites from LC–MS data using a Bayesian clustering approach , 2014, Bioinform..

[4]  Mingxun Wang,et al.  Bioactivity-Based Molecular Networking for the Discovery of Drug Leads in Natural Product Bioassay-Guided Fractionation. , 2018, Journal of natural products.

[5]  Tao Huan,et al.  MyCompoundID: using an evidence-based metabolome library for metabolite identification. , 2013, Analytical chemistry.

[6]  David Zuckerman,et al.  Electronic Colloquium on Computational Complexity, Report No. 100 (2005) Linear Degree Extractors and the Inapproximability of MAX CLIQUE and CHROMATIC NUMBER , 2005 .

[7]  John E. Hopcroft,et al.  Complexity of Computer Computations , 1974, IFIP Congress.

[8]  Tomáš Pluskal,et al.  Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. , 2012, Analytical chemistry.

[9]  M. T. Cabrita,et al.  Halogenated Compounds from Marine Algae , 2010, Marine drugs.

[10]  Aviv Amirav,et al.  Isotope abundance analysis methods and software for improved sample identification with supersonic gas chromatography/mass spectrometry. , 2006, Rapid communications in mass spectrometry : RCM.

[11]  Russell Impagliazzo,et al.  Complexity of k-SAT , 1999, Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317).

[12]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[13]  L Mark Hall,et al.  Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Nontargeted Metabolomics. , 2018, Analytical chemistry.

[14]  Stephen E. Stein,et al.  Metabolite profiling of a NIST Standard Reference Material for human plasma (SRM 1950): GC-MS, LC-MS, NMR, and clinical laboratory analyses, libraries, and web-based resources. , 2013, Analytical chemistry.

[15]  Yvan Saeys,et al.  Systematic Structural Characterization of Metabolites in Arabidopsis via Candidate Substrate-Product Pair Networks[C][W] , 2014, Plant Cell.

[16]  Stephen Stein,et al.  Mass spectral reference libraries: an ever-expanding resource for chemical identification. , 2012, Analytical chemistry.

[17]  P. Retailleau,et al.  Euphorbia dendroides Latex as a Source of Jatrophane Esters: Isolation, Structural Analysis, Conformational Study, and Anti-CHIKV Activity. , 2016, Journal of natural products.

[18]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[19]  K. Reinert,et al.  OpenMS: a flexible open-source software platform for mass spectrometry data analysis , 2016, Nature Methods.

[20]  Juho Rousu,et al.  Critical Assessment of Small Molecule Identification 2016: automated methods , 2017, Journal of Cheminformatics.

[21]  Jonathan Bisson,et al.  Taxonomically Informed Scoring Enhances Confidence in Natural Products Annotation , 2019, bioRxiv.

[22]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[23]  Emma L. Schymanski,et al.  Automatic recalibration and processing of tandem mass spectra using formula annotation. , 2013, Journal of mass spectrometry : JMS.

[24]  A. Rockwood,et al.  Dissociation of individual isotopic peaks: predicting isotopic distributions of product ions in MSn , 2003, Journal of the American Society for Mass Spectrometry.

[25]  Pieter C. Dorrestein,et al.  High-Resolution Liquid Chromatography Tandem Mass Spectrometry Enables Large Scale Molecular Characterization of Dissolved Organic Matter , 2017, Front. Mar. Sci..

[26]  Sebastian Böcker,et al.  Searching molecular structure databases using tandem MS data: are we there yet? , 2017, Current opinion in chemical biology.

[27]  Pieter C Dorrestein,et al.  Illuminating the dark matter in metabolomics , 2015, Proceedings of the National Academy of Sciences.

[28]  Thomas Zichner,et al.  Identifying the unknowns by aligning fragmentation trees. , 2012, Analytical chemistry.

[29]  E. Vuori,et al.  Isotopic pattern and accurate mass determination in urine drug screening by liquid chromatography/time-of-flight mass spectrometry. , 2006, Rapid communications in mass spectrometry : RCM.

[30]  Georg Pohnert,et al.  Formation of halogenated medium chain hydrocarbons by a lipoxygenase/hydroperoxide halolyase-mediated transformation in planktonic microalgae. , 2006, Journal of the American Chemical Society.

[31]  G. Siuzdak,et al.  METLIN: A Technology Platform for Identifying Knowns and Unknowns. , 2018, Analytical chemistry.

[32]  Simon Rogers,et al.  Probabilistic assignment of formulas to mass peaks in metabolomics experiments , 2009, Bioinform..

[33]  Zsuzsanna Lipták,et al.  SIRIUS: decomposing isotope patterns for metabolite identification , 2008, Bioinform..

[34]  Vinayak Agarwal,et al.  Complexity of naturally produced polybrominated diphenyl ethers revealed via mass spectrometry. , 2015, Environmental science & technology.

[35]  Sebastian Böcker,et al.  Mining molecular structure databases: Identification of small molecules based on fragmentation mass spectrometry data. , 2017, Mass spectrometry reviews.

[36]  Francesco Corona,et al.  Accelerated isotope fine structure calculation using pruned transition trees. , 2015, Analytical chemistry.

[37]  Juho Rousu,et al.  SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information , 2019, Nature Methods.

[38]  J. Keurentjes,et al.  Untargeted large-scale plant metabolomics using liquid chromatography coupled to mass spectrometry , 2007, Nature Protocols.

[39]  David S. Wishart,et al.  HMDB 4.0: the human metabolome database for 2018 , 2017, Nucleic Acids Res..

[40]  N. Hertkorn,et al.  Kendrick-Analogous Network Visualisation of Ion Cyclotron Resonance Fourier Transform Mass Spectra: Improved Options for the Assignment of Elemental Compositions and the Classification of Organic Molecular Complexity , 2011, European journal of mass spectrometry.

[41]  Sebastian Böcker,et al.  Predicting the Presence of Uncommon Elements in Unknown Biomolecules from Isotope Patterns. , 2016, Analytical chemistry.

[42]  Nuno Bandeira,et al.  Mass spectral molecular networking of living microbial colonies , 2012, Proceedings of the National Academy of Sciences.

[43]  Raymond E. Miller,et al.  Complexity of Computer Computations , 1972 .

[44]  Florian Rasche,et al.  Towards de novo identification of metabolites by analyzing tandem mass spectra , 2008, ECCB.

[45]  Kristian Fog Nielsen,et al.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking , 2016, Nature Biotechnology.

[46]  Tomasz Burzykowski,et al.  The isotopic distribution conundrum. , 2012, Mass spectrometry reviews.

[47]  S. Böcker,et al.  Searching molecular structure databases with tandem mass spectra using CSI:FingerID , 2015, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Emma L. Schymanski,et al.  Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects , 2016 .

[50]  Rainer Breitling,et al.  Integrated Probabilistic Annotation (IPA): A Bayesian-based annotation method for metabolomic profiles integrating biochemical connections, isotope patterns and adduct relationships. , 2019, Analytical chemistry.

[51]  Ge Xia,et al.  Strong computational lower bounds via parameterized complexity , 2006, J. Comput. Syst. Sci..

[52]  Emilien L. Jamin,et al.  ProbMetab : an R package for Bayesian probabilistic annotation of LC-MS based metabolomics , 2013 .

[53]  V. Dembitsky,et al.  Natural halogenated fatty acids: their analogues and derivatives. , 2002, Progress in lipid research.

[54]  Rob Knight,et al.  Chemical Impacts of the Microbiome Across Scales Reveal Novel Conjugated Bile Acids , 2019, bioRxiv.

[55]  T. Dittmar,et al.  A simple and efficient method for the solid‐phase extraction of dissolved organic matter (SPE‐DOM) from seawater , 2008 .

[56]  Oliver Fiehn,et al.  Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry , 2007, BMC Bioinformatics.

[57]  Sebastian Böcker,et al.  Fragmentation trees reloaded , 2014, Journal of Cheminformatics.