Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification

Motivation Identification of small molecules in a biological sample remains a major bottleneck in molecular biology, despite a decade of rapid development of computational approaches for predicting molecular structures using mass spectrometry (MS) data. Recently, there has been increasing interest in utilizing other information sources, such as liquid chromatography (LC) retention time (RT), to improve the MS based identifications. Results We put forward a probabilistic modelling framework to integrate MS and RT data of multiple features in an LC-MS experiment. We model the MS measurements and all pairwise retention order information as a Markov random field and use efficient approximate inference for scoring and ranking potential molecular structures. Our experiments show improved identification accuracy by combining tandem mass spectrometry data (MS2) and retention orders using our approach, thereby outperforming state-of-the-art methods. Furthermore, we demonstrate the benefit of our model when only a subset of LC-MS features have MS2 measurements available besides MS1. Availability and implementation Software and data is freely available at https://github.com/aalto-ics-kepaco/msms_rt_score_integration. Contact eric.bach@aalto.fi

[1]  Martin Krauss,et al.  Performance of combined fragmentation and retention prediction for the identification of organic micropollutants by LC-HRMS , 2018, Analytical and Bioanalytical Chemistry.

[2]  Rainer Breitling,et al.  Integrated Probabilistic Annotation (IPA): A Bayesian-based annotation method for metabolomic profiles integrating biochemical connections, isotope patterns and adduct relationships. , 2019, Analytical chemistry.

[3]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[4]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5]  Juho Rousu,et al.  Multilabel Structured Output Learning with Random Spanning Trees of Max-Margin Markov Networks , 2014, NIPS.

[6]  Juho Rousu,et al.  Fast metabolite identification with Input Output Kernel Regression , 2016, Bioinform..

[7]  S. Böcker,et al.  Searching molecular structure databases with tandem mass spectra using CSI:FingerID , 2015, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Jian Ji,et al.  Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics , 2018, Metabolites.

[9]  Martin J. Wainwright,et al.  MAP estimation via agreement on trees: message-passing and linear programming , 2005, IEEE Transactions on Information Theory.

[10]  Quantitative Structure–Retention Relationships with Non-Linear Programming for Prediction of Chromatographic Elution Order , 2019, International journal of molecular sciences.

[11]  Emma L. Schymanski,et al.  MetFrag relaunched: incorporating strategies beyond in silico fragmentation , 2016, Journal of Cheminformatics.

[12]  Hiroshi Mamitsuka,et al.  ADAPTIVE: leArning DAta-dePendenT, concIse molecular VEctors for fast, accurate metabolite identification from tandem mass spectra , 2019, Bioinform..

[13]  Jody C. May,et al.  Predicting Ion Mobility Collision Cross-Sections Using a Deep Neural Network: DeepCCS. , 2019, Analytical chemistry.

[14]  Gary Siuzdak,et al.  The METLIN small molecule dataset for machine learning-based retention time prediction , 2019, Nature Communications.

[15]  Juho Rousu,et al.  SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information , 2019, Nature Methods.

[16]  S. Böcker,et al.  Current status of retention time prediction in metabolite identification. , 2020, Journal of separation science.

[17]  Juho Rousu,et al.  Improved Small Molecule Identification through Learning Combinations of Kernel Regression Models , 2019, Metabolites.

[18]  L Mark Hall,et al.  Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Nontargeted Metabolomics. , 2018, Analytical chemistry.

[19]  Jun Feng Xiao,et al.  Metabolite identification and quantitation in LC-MS/MS-based metabolomics. , 2012, Trends in analytical chemistry : TRAC.

[20]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[21]  Pieter C Dorrestein,et al.  Illuminating the dark matter in metabolomics , 2015, Proceedings of the National Academy of Sciences.

[22]  Stefan Posch,et al.  Improving MetFrag with statistical learning of fragment annotations , 2019, BMC Bioinformatics.

[23]  Juho Rousu,et al.  Multilabel classification through random graph ensembles , 2014, Machine Learning.

[24]  Joachim M. Buhmann,et al.  Spanning Tree Approximations for Conditional Random Fields , 2009, AISTATS.

[25]  S. Neumann,et al.  PredRet: prediction of retention time by direct mapping between multiple chromatographic systems. , 2015, Analytical chemistry.

[26]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching , 2017, Journal of Cheminformatics.

[27]  R. Knight,et al.  Global chemical analysis of biology by mass spectrometry , 2017 .

[28]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[29]  BaldiPierre,et al.  2005 Speical Issue , 2005 .

[30]  Kristian Fog Nielsen,et al.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking , 2016, Nature Biotechnology.

[31]  Juho Rousu,et al.  Liquid‐chromatography retention order prediction for metabolite identification , 2018, Bioinform..

[32]  David S. Wishart,et al.  CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra , 2014, Nucleic Acids Res..

[33]  Hiroshi Mamitsuka,et al.  SIMPLE: Sparse Interaction Model over Peaks of moLEcules for fast, interpretable metabolite identification from tandem mass spectra , 2018, Bioinform..

[34]  Juho Rousu,et al.  Critical Assessment of Small Molecule Identification 2016: automated methods , 2017, Journal of Cheminformatics.

[35]  Jonathan Bisson,et al.  Taxonomically Informed Scoring Enhances Confidence in Natural Products Annotation , 2019, bioRxiv.

[36]  Hiroshi Mamitsuka,et al.  Recent advances and prospects of computational methods for metabolite identification: a review with emphasis on machine learning approaches , 2018, Briefings Bioinform..

[37]  M. Hirai,et al.  MassBank: a public repository for sharing mass spectral data for life sciences. , 2010, Journal of mass spectrometry : JMS.

[38]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.