Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction

The task of learning an expressive molecular representation is central to developing quantitative structure-activity and property relationships. Traditional approaches rely on group additivity rules, empirical measurements or parameters, or generation of thousands of descriptors. In this paper, we employ a convolutional neural network for this embedding task by treating molecules as undirected graphs with attributed nodes and edges. Simple atom and bond attributes are used to construct atom-specific feature vectors that take into account the local chemical environment using different neighborhood radii. By working directly with the full molecular graph, there is a greater opportunity for models to identify important features relevant to a prediction task. Unlike other graph-based approaches, our atom featurization preserves molecule-level spatial information that significantly enhances model performance. Our models learn to identify important features of atom clusters for the prediction of aqueous solubility, octanol solubility, melting point, and toxicity. Extensions and limitations of this strategy are discussed.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[3]  Igor V. Tetko,et al.  The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS , 2016, Journal of Cheminformatics.

[4]  B. Admire,et al.  Predicting the octanol solubility of organic compounds. , 2013, Journal of pharmaceutical sciences.

[5]  Navdeep Jaitly,et al.  Multi-task Neural Networks for QSAR Predictions , 2014, ArXiv.

[6]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[7]  David Ryan Koes,et al.  Protein-Ligand Scoring with Convolutional Neural Networks , 2016, Journal of chemical information and modeling.

[8]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[9]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[10]  Y Z Chen,et al.  Recent progresses in the exploration of machine learning methods as in-silico ADME prediction tools. , 2015, Advanced drug delivery reviews.

[11]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[12]  Thierry Kogej,et al.  Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks , 2017, ArXiv.

[13]  Gisbert Schneider,et al.  Deep Learning in Drug Discovery , 2016, Molecular informatics.

[14]  Scott Boyer,et al.  Choosing Feature Selection and Learning Algorithms in QSAR , 2014, J. Chem. Inf. Model..

[15]  Luc De Raedt,et al.  SMIREP: Predicting Chemical Activity from SMILES , 2006, J. Chem. Inf. Model..

[16]  D-S Cao,et al.  In silico toxicity prediction by support vector machine and SMILES representation-based string kernel , 2012, SAR and QSAR in environmental research.

[17]  John S. Delaney,et al.  ESOL: Estimating Aqueous Solubility Directly from Molecular Structure , 2004, J. Chem. Inf. Model..

[18]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[19]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[20]  Andreas Mayr,et al.  Deep Learning as an Opportunity in Virtual Screening , 2015 .

[21]  Emilio Benfenati,et al.  QSPR modeling of octanol water partition coefficient of platinum complexes by InChI-based optimal descriptors , 2009 .

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[23]  Tomasz Arodz,et al.  Computational methods in developing quantitative structure-activity relationships (QSAR): a review. , 2006, Combinatorial chemistry & high throughput screening.

[24]  Jesús Jover,et al.  Determination of Abraham Solute Parameters from Molecular Structure , 2004, J. Chem. Inf. Model..

[25]  Kaspar Riesen,et al.  Graph Embedding in Vector Spaces by Means of Prototype Selection , 2007, GbRPR.

[26]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[27]  Robert P. Sheridan,et al.  Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction , 2013, J. Chem. Inf. Model..

[28]  John B. O. Mitchell Machine learning methods in chemoinformatics , 2014, Wiley interdisciplinary reviews. Computational molecular science.

[29]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[30]  Vijay S. Pande,et al.  Low Data Drug Discovery with One-Shot Learning , 2016, ACS central science.

[31]  Jerzy Leszczynski,et al.  CORAL: QSPR model of water solubility based on local and global SMILES attributes. , 2013, Chemosphere.

[32]  Matthias Rarey,et al.  Feature trees: A new molecular similarity measure based on tree matching , 1998, J. Comput. Aided Mol. Des..

[33]  Igor V. Tetko,et al.  How Accurately Can We Predict the Melting Points of Drug-like Compounds? , 2014, J. Chem. Inf. Model..

[34]  James A. Platts,et al.  Estimation of Molecular Linear Free Energy Relation Descriptors Using a Group Contribution Approach , 1999, J. Chem. Inf. Comput. Sci..

[35]  Yvonne C. Martin,et al.  The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding , 1997, J. Chem. Inf. Comput. Sci..

[36]  Michael F. Lynch,et al.  Computer Storage and Retrieval of Generic Chemical Structures in Patents. Part 17. Evaluation of the Refined Search. , 1995 .

[37]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[38]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[39]  Eugene N Muratov,et al.  Universal Approach for Structural Interpretation of QSAR/QSPR Models , 2013, Molecular informatics.

[40]  Ulf Norinder,et al.  Molecular Descriptors Influencing Melting Point and Their Role in Classification of Solid Drugs , 2003, J. Chem. Inf. Comput. Sci..

[41]  Williams Antony,et al.  Jean-Claude Bradley Open Melting Point Dataset , 2014 .

[42]  Michael H Abraham,et al.  The solubility of liquid and solid compounds in dry octan-1-ol. , 2014, Chemosphere.

[43]  Massimo Piccardi,et al.  Discriminative prototype selection methods for graph embedding , 2013, Pattern Recognit..

[44]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[45]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[46]  Alexandre Tkatchenko,et al.  Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[47]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[48]  Thierry Kogej,et al.  Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks , 2017, ACS central science.

[49]  Knut Baumann,et al.  Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation , 2014, Journal of Cheminformatics.

[50]  Manuela Pavan,et al.  DRAGON SOFTWARE: AN EASY APPROACH TO MOLECULAR DESCRIPTOR CALCULATIONS , 2006 .

[51]  Andreas Zell,et al.  Optimal assignment kernels for attributed molecular graphs , 2005, ICML.

[52]  Gisbert Schneider,et al.  Graph Kernels for Molecular Similarity , 2010, Molecular informatics.

[53]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[54]  Michael F. Lynch,et al.  Computer storage and retrieval of generic chemical structures in patents. 13. Reduced graph generation , 1991, J. Chem. Inf. Comput. Sci..

[55]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[56]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[57]  Williams Antony,et al.  Jean-Claude Bradley Double Plus Good (Highly Curated and Validated) Melting Point Dataset , 2014 .

[58]  Pierre Baldi,et al.  Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules , 2013, J. Chem. Inf. Model..

[59]  Michael H. Abraham,et al.  Linear solvation energy relations , 1985 .

[60]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..