Computational methods for small molecule identification

Abstract Identification of small molecules remains a central question in analytical chemistry, in particular for natural product research, metabolomics, environmental research, and biomarker discovery. Mass spectrometry is the predominant technique for high-throughput analysis of small molecules. But it reveals only information about the mass of molecules and, by using tandem mass spectrometry, about the mass of molecular fragments. Automated interpretation of mass spectra is often limited to searching in spectral libraries, such that we can only dereplicate molecules for which we have already recorded reference mass spectra. In my thesis “Computational methods for small molecule identification” we developed SIRIUS, a tool for the structural elucidation of small molecules with tandem mass spectrometry. The method first computes a hypothetical fragmentation tree using combinatorial optimization. By using a Bayesian statistical model, we can learn parameters and hyperparameters of the underlying scoring directly from data. We demonstrate that the statistical model, which was fitted on a small dataset, generalizes well across many different datasets and mass spectrometry instruments. In a second step the fragmentation tree is used to predict a molecular fingerprint using kernel support vector machines. The predicted fingerprint can be searched in a structure database to identify the molecular structure. We demonstrate that our machine learning model outperforms all other methods for this task, including its predecessor FingerID. SIRIUS is available as commandline tool and as user interface. The molecular fingerprint prediction is implemented as web service and receives over one million requests per month.

[1]  Emma L. Schymanski,et al.  Nontarget Screening with High Resolution Mass Spectrometry in the Environment: Ready to Go? , 2017, Environmental science & technology.

[2]  Oliver Fiehn,et al.  Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm , 2006, BMC Bioinformatics.

[3]  Shuzhao Li,et al.  Computational Metabolomics: A Framework for the Million Metabolome , 2022 .

[4]  Pieter C Dorrestein,et al.  Illuminating the dark matter in metabolomics , 2015, Proceedings of the National Academy of Sciences.

[5]  Juho Rousu,et al.  SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information , 2019, Nature Methods.

[6]  Juho Rousu,et al.  Critical Assessment of Small Molecule Identification 2016: automated methods , 2017, Journal of Cheminformatics.

[7]  Sebastian Böcker,et al.  Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints , 2018, Bioinform..

[8]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[9]  Sebastian Böcker,et al.  Heuristic algorithms for the Maximum Colorful Subtree problem , 2018, WABI.

[10]  C. Supuran,et al.  Rethinking the Combination of Proton Exchanger Inhibitors in Cancer Therapy , 2017, Metabolites.

[11]  Jian Ji,et al.  Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics , 2018, Metabolites.

[12]  Florian Rasche,et al.  Computing fragmentation trees from tandem mass spectrometry data. , 2011, Analytical chemistry.

[13]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[14]  Emma L. Schymanski,et al.  Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects , 2016 .

[15]  Markus Chimani,et al.  Speedy Colorful Subtrees , 2015, COCOON.

[16]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[17]  Sebastian Böcker,et al.  Fragmentation trees reloaded , 2014, Journal of Cheminformatics.

[18]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[19]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[20]  L Mark Hall,et al.  Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Nontargeted Metabolomics. , 2018, Analytical chemistry.

[21]  Juho Rousu,et al.  Metabolite identification through multiple kernel learning on fragmentation trees , 2014, Bioinform..

[22]  Kristian Fog Nielsen,et al.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking , 2016, Nature Biotechnology.

[23]  Jordan L. Boyd-Graber,et al.  Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space , 2013, J. Comput. Biol..

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[26]  Xin Lu,et al.  Nontargeted screening of chemical contaminants and illegal additives in food based on liquid chromatography–high resolution mass spectrometry , 2017 .

[27]  Thomas Zichner,et al.  Identifying the unknowns by aligning fragmentation trees. , 2012, Analytical chemistry.

[28]  Juho Rousu,et al.  Metabolite identification and molecular fingerprint prediction through machine learning , 2012, Bioinform..

[29]  S. Nielsen,et al.  Pituitary Gonadotropins, Prolactin and Growth Hormone Differentially Regulate AQP1 Expression in the Porcine Ovarian Follicular Cells , 2017, International journal of molecular sciences.

[30]  S. Böcker,et al.  Searching molecular structure databases with tandem mass spectra using CSI:FingerID , 2015, Proceedings of the National Academy of Sciences of the United States of America.

[31]  D. Scott,et al.  Optimization and testing of mass spectral library search algorithms for compound identification , 1994, Journal of the American Society for Mass Spectrometry.

[32]  Florian Rasche,et al.  Finding Maximum Colorful Subtrees in Practice , 2012, RECOMB.

[33]  F. Ausubel Metabolomics , 2012, Nature Biotechnology.

[34]  Christoph Steinbeck,et al.  Current Challenges in Plant Eco-Metabolomics , 2018, International journal of molecular sciences.

[35]  Florian Rasche,et al.  Towards de novo identification of metabolites by analyzing tandem mass spectra , 2008, ECCB.

[36]  Juho Rousu,et al.  Soft Kernel Target Alignment for Two-Stage Multiple Kernel Learning , 2016, DS.