Predicting Single-Substance Phase Diagrams: A Kernel Approach on Graph Representations of Molecules.

This work presents a Gaussian process regression (GPR) model on top of a novel graph representation of chemical molecules that predicts thermodynamic properties of pure substances in single, double, and triple phases. A transferable molecular graph representation is proposed as the input for a marginalized graph kernel, which is the major component of the covariance function in our GPR models. Radial basis function kernels of temperature and pressure are also incorporated into the covariance function when necessary. We predicted three types of representative properties of pure substances in single, double, and triple phases, i.e., critical temperature, vapor-liquid equilibrium (VLE) density, and pressure-temperature density. The data is collected from Knovel Data Analysis Beta: NIST ThermoDynamics Pure Compounds. The accuracy of the models is nearly identical to the precision of the experimental measurements. Moreover, the reliability of our predictions can be quantified on a per-sample basis using the posterior uncertainty of the GPR model. We compare our model against Morgan fingerprints and a graph neural network to further demonstrate the advantage of the proposed method. The marginalized graph kernel is computed using GraphDot package at https://github.com/yhtang/GraphDot. All codes used in this work can be found at https://github.com/Xiangyan93/Chem-Graph-Kernel-Machine.

[1]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[2]  Nils M. Kriege,et al.  Subgraph Matching Kernels for Attributed Graphs , 2012, ICML.

[3]  Tatsuya Akutsu,et al.  Graph Kernels for Molecular Structure-Activity Relationship Analysis with Support Vector Machines , 2005, J. Chem. Inf. Model..

[4]  Thierry Langer,et al.  A compact review of molecular property prediction with graph neural networks. , 2020, Drug discovery today. Technologies.

[5]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[6]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[7]  Igor V. Tetko,et al.  The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS , 2016, Journal of Cheminformatics.

[8]  Andreas Zell,et al.  Optimal assignment kernels for attributed molecular graphs , 2005, ICML.

[9]  W. Goddard,et al.  UFF, a full periodic table force field for molecular mechanics and molecular dynamics simulations , 1992 .

[10]  Bin Li,et al.  Applications of machine learning in drug discovery and development , 2019, Nature Reviews Drug Discovery.

[11]  Thomas A. Halgren,et al.  Merck molecular force field. II. MMFF94 van der Waals and electrostatic parameters for intermolecular. interactions , 1996, J. Comput. Chem..

[12]  Qiang Ma,et al.  Dual Graph Convolutional Networks for Graph-Based Semi-Supervised Classification , 2018, WWW.

[13]  Kristian Kersting,et al.  Faster Kernels for Graphs with Continuous Attributes via Hashing , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[14]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[15]  Teslim Olayiwola,et al.  Application of Artificial Intelligence-based predictive methods in Ionic liquid studies: A review , 2021 .

[16]  Daniel W. Davies,et al.  Machine learning for molecular and materials science , 2018, Nature.

[17]  Wibe A. de Jong,et al.  Prediction of Atomization Energy Using Graph Kernel and Active Learning , 2018, The Journal of chemical physics.

[18]  A Deep Neural Network Model for Packing Density Predictions and its Application in the Study of 1.5 Million Organic Molecules , 2019 .

[19]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[20]  David Weininger,et al.  SMILES. 2. Algorithm for generation of unique SMILES notation , 1989, J. Chem. Inf. Comput. Sci..

[21]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[22]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[23]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[24]  Igor V. Tetko,et al.  How Accurately Can We Predict the Melting Points of Drug-like Compounds? , 2014, J. Chem. Inf. Model..

[25]  M. Withnall,et al.  Building attention and edge message passing neural networks for bioactivity and physical–chemical property prediction , 2020, Journal of Cheminformatics.

[26]  Liang Wu,et al.  Predicting Thermodynamic Properties of Alkanes by High-Throughput Force Field Simulation and Machine Learning , 2018, J. Chem. Inf. Model..

[27]  A. Choudhary,et al.  Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science , 2016 .

[28]  Regina Barzilay,et al.  Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction , 2017, J. Chem. Inf. Model..

[29]  Alán Aspuru-Guzik,et al.  A Diversified Machine Learning Strategy for Predicting and Understanding Molecular Melting Points , 2019 .

[30]  Michael D. Frenkel,et al.  ThermoData Engine (TDE) Version 9.0 (Pure Compounds, Binary Mixtures, Ternary Mixtures, and Chemical Reactions); NIST Standard Reference Database 103b | NIST , 2013 .

[31]  Kezheng Zhu,et al.  Generating a Machine-learned Equation of State for Fluid Properties. , 2020, The journal of physical chemistry. B.

[32]  Norman L. Allinger,et al.  Molecular mechanics. The MM3 force field for hydrocarbons. 3. The van der Waals' potentials and crystal data for aliphatic and aromatic hydrocarbons , 1989 .

[33]  Vladimir Vovk,et al.  Kernel Ridge Regression , 2013, Empirical Inference.

[34]  Thomas Blaschke,et al.  The rise of deep learning in drug discovery. , 2018, Drug discovery today.

[35]  Mike Preuss,et al.  Planning chemical syntheses with deep neural networks and symbolic AI , 2017, Nature.

[36]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[37]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[38]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[39]  J. L. Asensio,et al.  The use of the AMBER force field in conformational analysis of carbohydrate molecules: Determination of the solution conformation of methyl α‐lactoside by NMR spectroscopy, assisted by molecular mechanics and dynamics calculations , 1995, Biopolymers.

[40]  Jun Sese,et al.  Compound‐protein interaction prediction with end‐to‐end learning of neural networks for graphs and sequences , 2018, Bioinform..

[41]  A. Bender,et al.  Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. , 2006, IDrugs : the investigational drugs journal.

[42]  Nils M. Kriege,et al.  A survey on graph kernels , 2019, Applied Network Science.

[43]  Stephen R. Heller,et al.  InChI, the IUPAC International Chemical Identifier , 2015, Journal of Cheminformatics.

[44]  Ming-Jing Hwang,et al.  Derivation of class II force fields. I. Methodology and quantum force field for the alkyl functional group and alkane molecules , 1994, J. Comput. Chem..