Metabolite identification and molecular fingerprint prediction through machine learning

MOTIVATION Metabolite identification from tandem mass spectra is an important problem in metabolomics, underpinning subsequent metabolic modelling and network analysis. Yet, currently this task requires matching the observed spectrum against a database of reference spectra originating from similar equipment and closely matching operating parameters, a condition that is rarely satisfied in public repositories. Furthermore, the computational support for identification of molecules not present in reference databases is lacking. Recent efforts in assembling large public mass spectral databases such as MassBank have opened the door for the development of a new genre of metabolite identification methods. RESULTS We introduce a novel framework for prediction of molecular characteristics and identification of metabolites from tandem mass spectra using machine learning with the support vector machine. Our approach is to first predict a large set of molecular properties of the unknown metabolite from salient tandem mass spectral signals, and in the second step to use the predicted properties for matching against large molecule databases, such as PubChem. We demonstrate that several molecular properties can be predicted to high accuracy and that they are useful in de novo metabolite identification, where the reference database does not contain any spectra of the same molecule. AVAILABILITY An Matlab/Python package of the FingerID tool is freely available on the web at http://www.sourceforge.net/p/fingerid. CONTACT markus.heinonen@cs.helsinki.fi.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  David E. Rumelhart,et al.  A Neural Network That Classifies Mass Spectra , 2001 .

[3]  F. McLafferty Tandem mass spectrometry. , 1981, Science.

[4]  Ari Rantanen,et al.  FiD: a software for ab initio structural identification of product ions from tandem mass spectrometric data. , 2008, Rapid communications in mass spectrometry : RCM.

[5]  Matthias Müller-Hannemann,et al.  In silico fragmentation for computer assisted identification of metabolite mass spectra , 2010, BMC Bioinformatics.

[6]  Zsuzsanna Lipták,et al.  SIRIUS: decomposing isotope patterns for metabolite identification† , 2008, Bioinform..

[7]  Toby J. Gibson,et al.  KEPE—a motif frequently superimposed on sumoylation sites in metazoan chromatin proteins and transcription factors , 2008, Bioinform..

[8]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[9]  F. McLafferty,et al.  Computer‐aided interpretation of mass spectra , 1969 .

[10]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[11]  Herbert Oberacher,et al.  Combined use of ESI–QqTOF-MS and ESI–QqTOF-MS/MS with mass-spectral library search for qualitative analysis of drugs , 2006, Analytical and bioanalytical chemistry.

[12]  S. Böcker,et al.  Computational mass spectrometry for metabolomics: Identification of metabolites and small molecules , 2010, Analytical and bioanalytical chemistry.

[13]  Christophe Junot,et al.  Mass spectrometry for the identification of the discriminating signals from metabolomics: current status and future trends. , 2008, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[14]  R. Schuhmacher,et al.  On the inter-instrument and the inter-laboratory transferability of a tandem mass spectral reference library: 2. Optimization and characterization of the search algorithm. , 2009, Journal of mass spectrometry : JMS.

[15]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[16]  S. Stein,et al.  Estimating probabilities of correct identification from results of mass spectral library searches , 1994, Journal of the American Society for Mass Spectrometry.

[17]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[18]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[19]  Gail M. Pesyna,et al.  Computer‐aided interpretation of mass spectra. Information on substructural probabilities form stirs , 1976 .

[20]  David Wishart,et al.  Identification of bacteria using tandem mass spectrometry combined with a proteome database and statistical scoring. , 2004, Analytical chemistry.

[21]  David E. Rumelhart,et al.  MSnet: A Neural Network which Classifies Mass Spectra , 1990 .

[22]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[23]  Thierry Kogej,et al.  Comparison of Molecular Fingerprint Methods on the Basis of Biological Profile Data , 2009, J. Chem. Inf. Model..

[24]  D. Kell Metabolomics and systems biology: making sense of the soup. , 2004, Current opinion in microbiology.

[25]  Tony Jebara,et al.  A Kernel Between Sets of Vectors , 2003, ICML.