Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.

[1]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[2]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[3]  N. Cressie The origins of kriging , 1990 .

[4]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[5]  Gordon M. Crippen,et al.  Prediction of Physicochemical Parameters by Atomic Contributions , 1999, J. Chem. Inf. Comput. Sci..

[6]  Brian K. Shoichet,et al.  Virtual screening of chemical libraries , 2004, Nature.

[7]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[8]  Weitao Yang,et al.  Designing molecules by optimizing potentials. , 2006, Journal of the American Chemical Society.

[9]  J. Reymond,et al.  Chemical Space Travel , 2007, ChemMedChem.

[10]  Weitao Yang,et al.  Exploring chemical space with discrete, gradient, and hybrid optimization methods. , 2008, The Journal of chemical physics.

[11]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[12]  Peter Ertl,et al.  Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions , 2009, J. Cheminformatics.

[13]  Lorenz C. Blum,et al.  Chemical space as a source for new drugs , 2010 .

[14]  Gisbert Schneider,et al.  Virtual screening: an endless staircase? , 2010, Nature Reviews Drug Discovery.

[15]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[16]  Noel M. O'Boyle,et al.  Computational Design and Selection of Optimal Organic Photovoltaic Materials , 2011 .

[17]  Alán Aspuru-Guzik,et al.  The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid , 2011 .

[18]  G. V. Paolini,et al.  Quantifying the chemical beauty of drugs. , 2012, Nature chemistry.

[19]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[20]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[21]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[22]  Yanli Wang,et al.  Structure-Based Virtual Screening for Drug Discovery: a Problem-Centric Review , 2012, The AAPS Journal.

[23]  Andreas Bender,et al.  Recognizing Pitfalls in Virtual Screening: A Critical Review , 2012, J. Chem. Inf. Model..

[24]  C. Adachi,et al.  Highly efficient organic light-emitting diodes from delayed fluorescence , 2012, Nature.

[25]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[26]  K. Müller,et al.  Fast and accurate modeling of molecular atomization energies with machine learning. , 2011, Physical review letters.

[27]  Jasper Snoek,et al.  Nonparametric guidance of autoencoder representations using label information , 2012, J. Mach. Learn. Res..

[28]  G. Hutchison,et al.  Efficient Computational Screening of Organic Polymer Photovoltaics. , 2013, The journal of physical chemistry letters.

[29]  P. Wipf,et al.  Stochastic voyages into uncharted chemical space produce a representative library of all possible drug-like compounds. , 2013, Journal of the American Chemical Society.

[30]  Alexandre Varnek,et al.  Estimation of the size of drug-like chemical space based on GDB-17 data , 2013, Journal of Computer-Aided Molecular Design.

[31]  Stephen R. Heller,et al.  InChI - the worldwide chemical structure identifier standard , 2013, Journal of Cheminformatics.

[32]  David S. Wishart,et al.  DrugBank 4.0: shedding new light on drug metabolism , 2013, Nucleic Acids Res..

[33]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[34]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[35]  Pavlo O. Dral,et al.  Quantum chemistry structures and properties of 134 kilo molecules , 2014, Scientific Data.

[36]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[37]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[38]  K. Müller,et al.  Machine Learning Predictions of Molecular Properties: Accurate Many-Body Potentials and Nonlocality in Chemical Space , 2015, The journal of physical chemistry letters.

[39]  J. Reymond The chemical space project. , 2015, Accounts of chemical research.

[40]  David N. Beratan,et al.  Strategy To Discover Diverse Optimal Molecules in the Small Molecule Universe , 2015, J. Chem. Inf. Model..

[41]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[42]  Stéphane Mallat,et al.  Quantum Energy Regression using Scattering Transforms , 2015, ArXiv.

[43]  Benjamin G. Levine,et al.  Simulated evolution of fluorophores for light emitting diodes. , 2015, The Journal of chemical physics.

[44]  Alán Aspuru-Guzik,et al.  What Is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery , 2015 .

[45]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[46]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[47]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[48]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[49]  Paul Raccuglia,et al.  Machine-learning-assisted materials discovery using failed experiments , 2016, Nature.

[50]  Tom White,et al.  Sampling Generative Networks: Notes on a Few Effective Techniques , 2016, ArXiv.

[51]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[52]  Ryan P. Adams,et al.  Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. , 2016, Nature materials.

[53]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[54]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[55]  Mathew D. Halls,et al.  In silico evaluation of highly efficient organic light-emitting materials , 2016, Organic Photonics + Electronics.

[56]  Alán Aspuru-Guzik,et al.  Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models , 2017, ArXiv.

[57]  Matt J. Kusner,et al.  Grammar Variational Autoencoder , 2017, ICML.

[58]  Alexandre Tkatchenko,et al.  Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[59]  Koji Tsuda,et al.  ChemTS: an efficient python library for de novo molecular generation , 2017, Science and technology of advanced materials.

[60]  Alán Aspuru-Guzik,et al.  Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC) , 2017 .

[61]  Karen Simonyan,et al.  Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders , 2017, ICML.

[62]  Bowen Liu,et al.  Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models , 2017, ACS central science.

[63]  Richard E. Turner,et al.  Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control , 2016, ICML.

[64]  Thierry Kogej,et al.  Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks , 2017, ArXiv.

[65]  David Janz,et al.  Actively Learning what makes a Discrete Sequence Valid , 2017, ArXiv.

[66]  Vijay S. Pande,et al.  MoleculeNet: a benchmark for molecular machine learning , 2017, Chemical science.

[67]  Thomas Blaschke,et al.  Application of Generative Autoencoder in De Novo Molecular Design , 2017, Molecular informatics.

[68]  Thierry Kogej,et al.  Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks , 2017, ACS central science.

[69]  Joseph Gomes,et al.  MoleculeNet: a benchmark for molecular machine learning† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc02664a , 2017, Chemical science.