Deep learning to generate in silico chemical property libraries and candidate molecules for small molecule identification in complex samples.

Comprehensive and unambiguous identification of small molecules in complex samples will revolutionize our understanding of the role of metabolites in biological systems. Existing and emerging technologies have enabled measurement of chemical properties of molecules in complex mixtures and, in concert, are sensitive enough to resolve even stereoisomers. Despite these experimental advances, small molecule identification is inhibited by (i) chemical reference libraries (e.g., mass spectra, collision cross section, and other measurable property libraries) representing <1% of known molecules, limiting the number of possible identifications, and (ii) the lack of a method to generate candidate matches directly from experimental features (i.e., without a library). To this end, we developed a variational autoencoder (VAE) to learn a continuous numerical, or latent, representation of molecular structure to expand reference libraries for small molecule identification. We extended the VAE to include a chemical property decoder, trained as a multitask network, in order to shape the latent representation such that it assembles according to desired chemical properties. The approach is unique in its application to metabolomics and small molecule identification, with its focus on properties that can be obtained from experimental measurements (m/z, CCS) paired with its training paradigm, which involved a cascade of transfer learning iterations. First, molecular representation is learned from a large dataset of structures with m/z labels. Next, in silico property values are used to continue training, as experimental property data is limited. Finally, the network is further refined by being trained with the experimental data. This allows the network to learn as much as possible at each stage, enabling success with progressively smaller datasets without overfitting. Once trained, the network can be used to predict chemical properties directly from structure, as well as generate candidate structures with desired chemical properties. Our approach is orders of magnitude faster than first-principles simulation for CCS property prediction. Additionally, the ability to generate novel molecules along manifolds, defined by chemical property analogues, positions DarkChem as highly useful in a number of application areas, including metabolomics and small molecule identification, drug discovery and design, chemical forensics, and beyond.

[1]  Matthias Müller-Hannemann,et al.  In silico fragmentation for computer assisted identification of metabolite mass spectra , 2010, BMC Bioinformatics.

[2]  Anubhav Jain,et al.  From the computer to the laboratory: materials discovery and design using first-principles calculations , 2012, Journal of Materials Science.

[3]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[4]  Sergey Nikolenko,et al.  druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico. , 2017, Molecular pharmaceutics.

[5]  Jody C. May,et al.  Ion mobility conformational lipid atlas for high confidence lipidomics , 2019, Nature Communications.

[6]  Cody R. Goodwin,et al.  Conformational Ordering of Biomolecules in the Gas Phase: Nitrogen Collision Cross Sections Measured on a Prototype High Resolution Drift Tube Ion Mobility-Mass Spectrometer , 2014, Analytical chemistry.

[7]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[8]  Antony J. Williams,et al.  Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA , 2017, Journal of Exposure Science & Environmental Epidemiology.

[9]  Emma L. Schymanski,et al.  Identifying small molecules via high resolution mass spectrometry: communicating confidence. , 2014, Environmental science & technology.

[10]  L. Paša-Tolić,et al.  Advanced solvent based methods for molecular characterization of soil organic matter by high-resolution mass spectrometry. , 2015, Analytical chemistry.

[11]  Jordi Munoz-Muriedas,et al.  Elucidation of Drug Metabolite Structural Isomers Using Molecular Modeling Coupled with Ion Mobility Mass Spectrometry. , 2016, Analytical chemistry.

[12]  Juho Rousu,et al.  Liquid‐chromatography retention order prediction for metabolite identification , 2018, Bioinform..

[13]  John A McLean,et al.  Untargeted Molecular Discovery in Primary Metabolism: Collision Cross Section as a Molecular Descriptor in Ion Mobility-Mass Spectrometry. , 2018, Analytical chemistry.

[14]  Yehia M. Ibrahim,et al.  Ion Elevators and Escalators in Multilevel Structures for Lossless Ion Manipulations. , 2017, Analytical chemistry.

[15]  Thomas Blaschke,et al.  Application of Generative Autoencoder in De Novo Molecular Design , 2017, Molecular informatics.

[16]  David S. Wishart,et al.  CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra , 2014, Nucleic Acids Res..

[17]  Igor I. Baskin,et al.  Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? , 2012, J. Chem. Inf. Model..

[18]  Alban Arrault,et al.  UPLC–MS retention time prediction: a machine learning approach to metabolite identification in untargeted profiling , 2015, Metabolomics.

[19]  Bernhard O. Palsson,et al.  Ion Mobility Derived Collision Cross Sections to Support Metabolomics Applications , 2014, Analytical chemistry.

[20]  Seokho Kang,et al.  Deep-learning-based inverse design model for intelligent discovery of organic molecules , 2018, npj Computational Materials.

[21]  Oliver Fiehn,et al.  Increasing Compound Identification Rates in Untargeted Lipidomics Research with Liquid Chromatography Drift Time-Ion Mobility Mass Spectrometry. , 2018, Analytical chemistry.

[22]  Gordon A Anderson,et al.  Squeezing of Ion Populations and Peaks in Traveling Wave Ion Mobility Separations and Structures for Lossless Ion Manipulations Using Compression Ratio Ion Mobility Programming. , 2016, Analytical chemistry.

[23]  Nikola Tolić,et al.  21 Tesla Fourier Transform Ion Cyclotron Resonance Mass Spectrometer Greatly Expands Mass Spectrometry Toolbox , 2016, Journal of The American Society for Mass Spectrometry.

[24]  Ryan P. Adams,et al.  Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. , 2016, Nature materials.

[25]  Joerg Hippler,et al.  Contaminant screening of wastewater with HPLC-IM-qTOF-MS and LC+LC-IM-qTOF-MS using a CCS database , 2016, Analytical and Bioanalytical Chemistry.

[26]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[27]  Niranjan Govind,et al.  Structural Elucidation of cis/trans Dicaffeoylquinic Acid Photoisomerization Using Ion Mobility Spectrometry-Mass Spectrometry. , 2017, The journal of physical chemistry letters.

[28]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[29]  Yoshua Bengio,et al.  Object Recognition with Gradient-Based Learning , 1999, Shape, Contour and Grouping in Computer Vision.

[30]  Yehia M. Ibrahim,et al.  New frontiers for mass spectrometry based upon structures for lossless ion manipulations. , 2017, The Analyst.

[31]  T. Wyttenbach,et al.  Salt Bridge Structures in the Absence of Solvent? The Case for the Oligoglycines , 1998 .

[32]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[33]  Zhiwei Zhou,et al.  MetCCS predictor: a web server for predicting collision cross‐section values of metabolites in ion mobility‐mass spectrometry based metabolomics , 2017, Bioinform..

[34]  Evan Bolton,et al.  ClassyFire: automated chemical classification with a comprehensive, computable taxonomy , 2016, Journal of Cheminformatics.

[35]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[36]  D. Clemmer,et al.  Intrinsic Size Parameters for Val, Ile, Leu, Gln, Thr, Phe, and Trp Residues from Ion Mobility Measurements of Polyamino Acid Ions , 1999 .

[37]  Zheng-Jiang Zhu,et al.  LipidCCS: Prediction of Collision Cross-Section Values for Lipids with High Precision To Support Ion Mobility-Mass Spectrometry-Based Lipidomics. , 2017, Analytical chemistry.

[38]  S. Valentine,et al.  ESI/ion trap/ion mobility/time-of-flight mass spectrometry for rapid and sensitive analysis of biomolecular mixtures. , 1999, Analytical chemistry.

[39]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[40]  Davy Guillarme,et al.  Adding a new separation dimension to MS and LC-MS: What is the utility of ion mobility spectrometry? , 2018, Journal of separation science.

[41]  Ann M Richard,et al.  Distributed structure-searchable toxicity (DSSTox) public database network: a proposal. , 2002, Mutation research.

[42]  Lars Bohlin,et al.  Functional versus chemical diversity: is biodiversity important for drug discovery? , 2002, Trends in pharmacological sciences.

[43]  Jody C. May,et al.  Evaluation of Collision Cross Section Calibrants for Structural Analysis of Lipids by Traveling Wave Ion Mobility-Mass Spectrometry , 2016, Analytical chemistry.

[44]  Michael Eiden,et al.  Getting the right answers: understanding metabolomics challenges , 2015, Expert review of molecular diagnostics.

[45]  Zheng-Jiang Zhu,et al.  Large-Scale Prediction of Collision Cross-Section Values for Metabolites in Ion Mobility-Mass Spectrometry. , 2016, Analytical chemistry.

[46]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[47]  Nikola Tolić,et al.  Sequential extraction protocol for organic matter from soils and sediments using high resolution mass spectrometry. , 2017, Analytica chimica acta.

[48]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[49]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[50]  Thomas O. Metz,et al.  An automated framework for NMR chemical shift calculations of small organic molecules , 2018, Journal of Cheminformatics.

[51]  C. Chui,et al.  Article in Press Applied and Computational Harmonic Analysis a Randomized Algorithm for the Decomposition of Matrices , 2022 .

[52]  Jody C. May,et al.  Predicting Ion Mobility Collision Cross-Sections Using a Deep Neural Network: DeepCCS. , 2019, Analytical chemistry.

[53]  David S. Wishart,et al.  HMDB 4.0: the human metabolome database for 2018 , 2017, Nucleic Acids Res..

[54]  Emma L. Schymanski,et al.  Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects , 2016 .

[55]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[56]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[57]  Franklin S. Cooper,et al.  Speech Understanding Systems , 1976, Artificial Intelligence.

[58]  Matthias Rupp,et al.  Big Data Meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach. , 2015, Journal of chemical theory and computation.

[59]  Jamie R. Nuñez,et al.  Advancing Standards-Free Methods for the Identification of Small Molecules in Complex Samples , 2018, 1810.07367.

[60]  Yehia M. Ibrahim,et al.  Greatly Increasing Trapped Ion Populations for Mobility Separations Using Traveling Waves in Structures for Lossless Ion Manipulations. , 2016, Analytical chemistry.

[61]  Alán Aspuru-Guzik,et al.  A Mixed Quantum Chemistry/Machine Learning Approach for the Fast and Accurate Prediction of Biochemical Redox Potentials and Its Large-Scale Application to 315 000 Redox Reactions , 2019, ACS central science.

[62]  O. Fiehn Metabolomics – the link between genotypes and phenotypes , 2004, Plant Molecular Biology.

[63]  Sebastian Böcker,et al.  Computational mass spectrometry for small-molecule fragmentation , 2014 .

[64]  G A Nagana Gowda,et al.  Overview of mass spectrometry-based metabolomics: opportunities and challenges. , 2014, Methods in molecular biology.

[65]  Jamie R. Nuñez,et al.  ISiCLE: A Quantum Chemistry Pipeline for Establishing in Silico Collision Cross Section Libraries. , 2019, Analytical chemistry.

[66]  Yehia M. Ibrahim,et al.  Characterization of Traveling Wave Ion Mobility Separations in Structures for Lossless Ion Manipulations. , 2015, Analytical chemistry.

[67]  Gordon A Anderson,et al.  Ultra-High Resolution Ion Mobility Separations Utilizing Traveling Waves in a 13 m Serpentine Path Length Structures for Lossless Ion Manipulations Module. , 2016, Analytical chemistry.

[68]  Sebastian Böcker,et al.  Searching molecular structure databases using tandem MS data: are we there yet? , 2017, Current opinion in chemical biology.

[69]  J P Reilly,et al.  Three-dimensional ion mobility/TOFMS analysis of electrosprayed biomolecules. , 1998, Analytical chemistry.

[70]  Pavlo O. Dral,et al.  Quantum chemistry structures and properties of 134 kilo molecules , 2014, Scientific Data.

[71]  Marie Mardal,et al.  Prediction of collision cross section and retention time for broad scope screening in gradient reversed-phase liquid chromatography-ion mobility-high resolution accurate mass spectrometry. , 2018, Journal of chromatography. A.

[72]  Yehia M. Ibrahim,et al.  Development of a new ion mobility time-of-flight mass spectrometer , 2015 .

[73]  John A McLean,et al.  Collision cross section compendium to annotate and predict multi-omic compound identities† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c8sc04396e , 2018, Chemical science.

[74]  Jean-Louis Reymond,et al.  Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 , 2012, J. Chem. Inf. Model..

[75]  Lirong Chen,et al.  Use of Natural Products as Chemical Library for Drug Discovery and Network Pharmacology , 2013, PloS one.

[76]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[77]  Nigel W. Hardy,et al.  Proposed minimum reporting standards for chemical analysis , 2007, Metabolomics.

[78]  Russ Greiner,et al.  Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification , 2013, Metabolomics.