Searching molecular structure databases with tandem mass spectra using CSI:FingerID

Significance Untargeted metabolomics experiments usually rely on tandem MS (MS/MS) to identify the thousands of compounds in a biological sample. Today, the vast majority of metabolites remain unknown. Recently, several computational approaches were presented for searching molecular structure databases using MS/MS data. Here, we present CSI:FingerID, which combines fragmentation tree computation and machine learning. An in-depth evaluation on two large-scale datasets shows that our method can find 150% more correct identifications than the second-best search method. In comparison with the two runner-up methods, CSI:FingerID reaches 5.4-fold more unique identifications. We also present evaluations indicating that the performance of our method will further improve when more training data become available. CSI:FingerID is publicly available at www.csi-fingerid.org. Metabolites provide a direct functional signature of cellular state. Untargeted metabolomics experiments usually rely on tandem MS to identify the thousands of compounds in a biological sample. Today, the vast majority of metabolites remain unknown. We present a method for searching molecular structure databases using tandem MS data of small molecules. Our method computes a fragmentation tree that best explains the fragmentation spectrum of an unknown molecule. We use the fragmentation tree to predict the molecular structure fingerprint of the unknown compound using machine learning. This fingerprint is then used to search a molecular structure database such as PubChem. Our method is shown to improve on the competing methods for computational metabolite identification by a considerable margin.

[1]  O. Fiehn,et al.  Using fragmentation trees and mass spectral trees for identifying unknown compounds in metabolomics. , 2015, Trends in analytical chemistry : TRAC.

[2]  Sebastian Böcker,et al.  Fragmentation trees reloaded , 2014, Journal of Cheminformatics.

[3]  B. Bowen,et al.  MIDAS: a database-searching algorithm for metabolite identification in metabolomics. , 2014, Analytical chemistry.

[4]  Antony J. Williams,et al.  The Royal Society of Chemistry and the delivery of chemistry data repositories for the community , 2014, Journal of Computer-Aided Molecular Design.

[5]  Sebastian Böcker,et al.  Molecular Formula Identification Using Isotope Pattern Analysis and Calculation of Fragmentation Trees. , 2014, Mass spectrometry.

[6]  Juho Rousu,et al.  Metabolite identification through multiple kernel learning on fragmentation trees , 2014, Bioinform..

[7]  David S. Wishart,et al.  CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra , 2014, Nucleic Acids Res..

[8]  A. Arkin,et al.  Metabolomic data streaming for biology-dependent data acquisition , 2014, Nature Biotechnology.

[9]  R. Bino,et al.  In silico prediction and automatic LC-MS(n) annotation of green tea metabolites in urine. , 2014, Analytical chemistry.

[10]  Yvan Saeys,et al.  Systematic Structural Characterization of Metabolites in Arabidopsis via Candidate Substrate-Product Pair Networks[C][W] , 2014, Plant Cell.

[11]  Christoph Steinbeck,et al.  Efficient ring perception for the Chemistry Development Kit , 2014, Journal of Cheminformatics.

[12]  Russ Greiner,et al.  Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification , 2013, Metabolomics.

[13]  Andreas Zell,et al.  Automated Label-free Quantification of Metabolites from Liquid Chromatography–Mass Spectrometry Data* , 2013, Molecular & Cellular Proteomics.

[14]  Ion I. Mandoiu,et al.  In Silico Enzymatic Synthesis of a 400 000 Compound Biochemical Database for Nontargeted Metabolomics , 2013, J. Chem. Inf. Model..

[15]  Oliver Fiehn,et al.  LipidBlast - in-silico tandem mass spectrometry database for lipid identification , 2013, Nature Methods.

[16]  Nuno Bandeira,et al.  MS/MS networking guided analysis of molecule and gene cluster families , 2013, Proceedings of the National Academy of Sciences.

[17]  Lars Ridder,et al.  Automatic chemical structure annotation of an LC-MS(n) based metabolic profile from green tea. , 2013, Analytical chemistry.

[18]  Sebastian Böcker,et al.  Computational mass spectrometry for small molecules , 2013, Journal of Cheminformatics.

[19]  Steffen Neumann,et al.  MetFusion: integration of compound identification strategies. , 2013, Journal of mass spectrometry : JMS.

[20]  A. Heck,et al.  Next-generation proteomics: towards an integrative view of proteome dynamics , 2012, Nature Reviews Genetics.

[21]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information , 2012, Nucleic Acids Res..

[22]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[23]  David S. Wishart,et al.  HMDB 3.0—The Human Metabolome Database in 2013 , 2012, Nucleic Acids Res..

[24]  Lars Ridder,et al.  Substructure-based annotation of high-resolution multistage MS(n) spectral trees. , 2012, Rapid communications in mass spectrometry : RCM.

[25]  Nicola Zamboni,et al.  Metabolite identification and molecular fingerprint prediction through machine learning , 2012, Bioinform..

[26]  Stephen Stein,et al.  Mass spectral reference libraries: an ever-expanding resource for chemical identification. , 2012, Analytical chemistry.

[27]  Markus Chimani,et al.  Fast alignment of fragmentation trees , 2012, Bioinform..

[28]  Ralf J. M. Weber,et al.  Mass appeal: metabolite identification in mass spectrometry-focused untargeted metabolomics , 2012, Metabolomics.

[29]  Nuno Bandeira,et al.  Mass spectral molecular networking of living microbial colonies , 2012, Proceedings of the National Academy of Sciences.

[30]  Florian Rasche,et al.  Finding Maximum Colorful Subtrees in Practice , 2012, RECOMB.

[31]  G. Siuzdak,et al.  Innovation: Metabolomics: the apogee of the omics trilogy , 2012, Nature Reviews Molecular Cell Biology.

[32]  Thomas Zichner,et al.  Identifying the unknowns by aligning fragmentation trees. , 2012, Analytical chemistry.

[33]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[34]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[35]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[36]  Florian Rasche,et al.  Computing fragmentation trees from tandem mass spectrometry data. , 2011, Analytical chemistry.

[37]  M. Baker Metabolomics: from small molecules to big ideas , 2011, Nature Methods.

[38]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[39]  Matthias Müller-Hannemann,et al.  In silico fragmentation for computer assisted identification of metabolite mass spectra , 2010, BMC Bioinformatics.

[40]  P. Pevzner,et al.  Automated de novo protein sequencing of monoclonal antibodies , 2008, Nature Biotechnology.

[41]  Ari Rantanen,et al.  FiD: a software for ab initio structural identification of product ions from tandem mass spectrometric data. , 2008, Rapid communications in mass spectrometry : RCM.

[42]  Frederick P. Roth,et al.  Chemical substructures that enrich for biological activity , 2008, Bioinform..

[43]  Sebastian Böcker,et al.  Towards de novo identification of metabolites by analyzing tandem mass spectra , 2008, ECCB.

[44]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[45]  Nigel W. Hardy,et al.  Proposed minimum reporting standards for chemical analysis , 2007, Metabolomics.

[46]  R. Mortishire-Smith,et al.  Automated assignment of high‐resolution collisionally activated dissociation mass spectra using a systematic bond disconnection approach , 2005 .

[47]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[48]  C. Adams,et al.  Identification , 2004, Encyclopedia of Cryptography and Security.

[49]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[50]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[51]  Christopher K. I. Williams,et al.  Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2001 .

[52]  Frank Rebentrost,et al.  On the fragmentation of benzene by multiphotoionization , 1981 .

[53]  Sebastian Böcker,et al.  Computational mass spectrometry for small-molecule fragmentation , 2014 .

[54]  S. Cenk Sahinalp,et al.  Research in Computational Molecular Biology , 2011, Lecture Notes in Computer Science.

[55]  S. Kanaya,et al.  KNApSAcK: A Comprehensive Species-Metabolite Relationship Database , 2006 .

[56]  N. Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[57]  John C. Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[58]  Joshua Lederberg,et al.  Applications of Artificial Intelligence for Organic Chemistry: The DENDRAL Project , 1980 .

[59]  P. Willett,et al.  Promoting Access to White Rose Research Papers Similarity-based Virtual Screening Using 2d Fingerprints , 2022 .