SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties

Chemical databases store information in text representations, and the SMILES format is a universal standard used in many cheminformatics software. Encoded in each SMILES string is structural information that can be used to predict complex chemical properties. In this work, we develop SMILES2vec, a deep RNN that automatically learns features from SMILES to predict chemical properties, without the need for additional explicit feature engineering. Using Bayesian optimization methods to tune the network architecture, we show that an optimized SMILES2vec model can serve as a general-purpose neural network for predicting distinct chemical properties including toxicity, activity, solubility and solvation energy, while also outperforming contemporary MLP neural networks that uses engineered features. Furthermore, we demonstrate proof-of-concept of interpretability by developing an explanation mask that localizes on the most important characters used in making a prediction. When tested on the solubility dataset, it identified specific parts of a chemical that is consistent with established first-principles knowledge with an accuracy of 88%. Our work demonstrates that neural networks can learn technically accurate chemical concept and provide state-of-the-art accuracy, making interpretable deep neural networks a useful tool of relevance to the chemical industry.

[1]  J. P. Grossman,et al.  Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Esben Jannik Bjerrum,et al.  SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules , 2017, ArXiv.

[3]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[4]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[5]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[6]  Vijay S. Pande,et al.  Massively Multitask Networks for Drug Discovery , 2015, ArXiv.

[7]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[8]  Alexandre Tkatchenko,et al.  Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[9]  Klaus Schulten,et al.  GPU-accelerated molecular modeling coming of age. , 2010, Journal of molecular graphics & modelling.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[12]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[13]  Vijay S. Pande,et al.  MoleculeNet: a benchmark for molecular machine learning , 2017, Chemical science.

[14]  David L Mobley,et al.  Alchemical free energy methods for drug discovery: progress and challenges. , 2011, Current opinion in structural biology.

[15]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[16]  S. Joshua Swamidass,et al.  Modeling Reactivity to Biological Macromolecules with a Deep Multitask Network , 2016, ACS central science.

[17]  J S Smith,et al.  ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost , 2016, Chemical science.

[18]  Navdeep Jaitly,et al.  Multi-task Neural Networks for QSAR Predictions , 2014, ArXiv.

[19]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[20]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Gisbert Schneider,et al.  Deep Learning in Drug Discovery , 2016, Molecular informatics.

[24]  Li Di,et al.  Bridging solubility between drug discovery and development. , 2012, Drug discovery today.

[25]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26]  Abhinav Vishnu,et al.  Adaptive neuron apoptosis for accelerating deep learning on large scale systems , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[27]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[28]  Pierre Baldi,et al.  Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules , 2013, J. Chem. Inf. Model..

[29]  M. Rupp,et al.  Machine learning of molecular electronic properties in chemical compound space , 2013, 1305.7074.

[30]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[31]  Naomi L Kruhlak,et al.  Progress in QSAR toxicity screening of pharmaceutical impurities and other FDA regulated products. , 2007, Advanced drug delivery reviews.

[32]  Abhinav Vishnu,et al.  How Much Chemistry Does a Deep Neural Network Need to Know to Make Accurate Predictions? , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Abhinav Vishnu,et al.  Deep learning for computational chemistry , 2017, J. Comput. Chem..

[34]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[35]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[37]  Ian Dewancker,et al.  Evaluation System for a Bayesian Optimization Service , 2016, ArXiv.

[38]  Wojciech Czarnecki,et al.  Learning to SMILE(S) , 2016, ArXiv.

[39]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[40]  Abhinav Vishnu,et al.  ChemNet: A Transferable and Generalizable Deep Neural Network for Small-Molecule Property Prediction , 2017, ArXiv.

[41]  W. L. Jorgensen The Many Roles of Computation in Drug Discovery , 2004, Science.

[42]  Abhinav Vishnu,et al.  Chemception: A Deep Neural Network with Minimal Chemistry Knowledge Matches the Performance of Expert-developed QSAR/QSPR Models , 2017, ArXiv.

[43]  John R. Platt,et al.  Influence of Neighbor Bonds on Additive Bond Properties in Paraffins , 1947 .

[44]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[45]  Thomas Unterthiner,et al.  Multi-Task Deep Networks for Drug Target Prediction , 2015 .

[46]  Izhar Wallach,et al.  AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery , 2015, ArXiv.

[47]  Nicholas Bodor,et al.  Computer-aided drug design: the role of quantitative structure-property, structure-activity and structure-metabolism relationships (QSPR, QSAR, QSMR) , 2002 .