Probabilistic Framework for Integration of Mass Spectrum and Retention Time Information in Small Molecule Identification.

MOTIVATION Identification of small molecules in a biological sample remains a major bottleneck in molecular biology, despite a decade of rapid development of computational approaches for predicting molecular structures using mass spectrometry (MS) data. Recently, there has been increasing interest in utilizing other information sources, such as liquid chromatography (LC) retention time (RT), to improve identifications solely based on MS information, such as precursor mass-per-charge and tandem mass spectra (MS2). RESULTS We put forward a probabilistic modelling framework to integrate MS and RT data of multiple features in an LC-MS experiment. We model the MS measurements and all pairwise retention order information as a Markov random field and use efficient approximate inference for scoring and ranking potential molecular structures. Our experiments show improved identification accuracy by combining MS2 data and retention orders using our approach, thereby outperforming state-of-the-art methods. Furthermore, we demonstrate the benefit of our model when only a subset of LC-MS features have MS2 measurements available besides MS1. AVAILABILITY AND IMPLEMENTATION Software and data is freely available at https://github.com/aalto-ics-kepaco/msms_rt_score_integration.

[1]  S. Gifford,et al.  Revisiting the distribution of oceanic N2 fixation and estimating diazotrophic contribution to marine production , 2019, Nature Communications.

[2]  L Mark Hall,et al.  Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Nontargeted Metabolomics. , 2018, Analytical chemistry.

[3]  Juho Rousu,et al.  Fast metabolite identification with Input Output Kernel Regression , 2016, Bioinform..

[4]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching , 2017, Journal of Cheminformatics.

[5]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[6]  Juho Rousu,et al.  SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information , 2019, Nature Methods.

[7]  Gary Siuzdak,et al.  The METLIN small molecule dataset for machine learning-based retention time prediction , 2019, Nature Communications.

[8]  Juho Rousu,et al.  Critical Assessment of Small Molecule Identification 2016: automated methods , 2017, Journal of Cheminformatics.

[9]  Juho Rousu,et al.  Liquid‐chromatography retention order prediction for metabolite identification , 2018, Bioinform..

[10]  Quantitative Structure–Retention Relationships with Non-Linear Programming for Prediction of Chromatographic Elution Order , 2019, International journal of molecular sciences.

[11]  Hiroshi Mamitsuka,et al.  Recent advances and prospects of computational methods for metabolite identification: a review with emphasis on machine learning approaches , 2018, Briefings Bioinform..

[12]  J. WainwrightM.,et al.  MAP estimation via agreement on trees , 2005 .

[13]  Rainer Breitling,et al.  Integrated Probabilistic Annotation (IPA): A Bayesian-based annotation method for metabolomic profiles integrating biochemical connections, isotope patterns and adduct relationships. , 2019, Analytical chemistry.

[14]  Jody C. May,et al.  Predicting Ion Mobility Collision Cross-Sections Using a Deep Neural Network: DeepCCS. , 2019, Analytical chemistry.

[15]  S. Böcker,et al.  Current status of retention time prediction in metabolite identification. , 2020, Journal of separation science.

[16]  R. Knight,et al.  Global chemical analysis of biology by mass spectrometry , 2017 .

[17]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[18]  Stefan Posch,et al.  Improving MetFrag with statistical learning of fragment annotations , 2019, BMC Bioinformatics.

[19]  Hiroshi Mamitsuka,et al.  SIMPLE: Sparse Interaction Model over Peaks of moLEcules for fast, interpretable metabolite identification from tandem mass spectra , 2018, Bioinform..

[20]  Erin E. Carlson,et al.  Sharing and community curation of mass spectrometry data with GNPS , 2016 .

[21]  Juho Rousu,et al.  Multilabel Structured Output Learning with Random Spanning Trees of Max-Margin Markov Networks , 2014, NIPS.

[22]  Hiroshi Mamitsuka,et al.  ADAPTIVE: leArning DAta-dePendenT, concIse molecular VEctors for fast, accurate metabolite identification from tandem mass spectra , 2019, Bioinform..

[23]  BaldiPierre,et al.  2005 Speical Issue , 2005 .

[24]  Emma L. Schymanski,et al.  MetFrag relaunched: incorporating strategies beyond in silico fragmentation , 2016, Journal of Cheminformatics.

[25]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[26]  Martin J. Wainwright,et al.  MAP estimation via agreement on trees: message-passing and linear programming , 2005, IEEE Transactions on Information Theory.

[27]  Jonathan Bisson,et al.  Taxonomically Informed Scoring Enhances Confidence in Natural Products Annotation , 2019, Front. Plant Sci..

[28]  Benjamin G. Janesko,et al.  Predicting ion mobility collision cross sections directly from standard quantum chemistry software. , 2018, Journal of mass spectrometry : JMS.

[29]  Pieter C Dorrestein,et al.  Illuminating the dark matter in metabolomics , 2015, Proceedings of the National Academy of Sciences.

[30]  David S. Wishart,et al.  CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra , 2014, Nucleic Acids Res..

[31]  Martin Krauss,et al.  Performance of combined fragmentation and retention prediction for the identification of organic micropollutants by LC-HRMS , 2018, Analytical and Bioanalytical Chemistry.

[32]  Joachim M. Buhmann,et al.  Spanning Tree Approximations for Conditional Random Fields , 2009, AISTATS.

[33]  Jun Feng Xiao,et al.  Metabolite identification and quantitation in LC-MS/MS-based metabolomics. , 2012, Trends in analytical chemistry : TRAC.

[34]  Juho Rousu,et al.  Improved Small Molecule Identification through Learning Combinations of Kernel Regression Models , 2019, Metabolites.

[35]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[36]  S. Böcker,et al.  Searching molecular structure databases with tandem mass spectra using CSI:FingerID , 2015, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Juho Rousu,et al.  Multilabel classification through random graph ensembles , 2014, Machine Learning.

[38]  Kristian Fog Nielsen,et al.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking , 2016, Nature Biotechnology.

[39]  S. Neumann,et al.  PredRet: prediction of retention time by direct mapping between multiple chromatographic systems. , 2015, Analytical chemistry.

[40]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[41]  Jian Ji,et al.  Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics , 2018, Metabolites.