Transformer-CNN: Swiss knife for QSAR modeling and interpretation

We present SMILES-embeddings derived from the internal encoder state of a Transformer [ 1 ] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [ 2 ] architecture upon the embeddings results in higher quality interpretable QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis is based on an internal consensus. That both the augmentation and transfer learning are based on embeddings allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings needed to train a QSAR model are available on https://github.com/bigchem/transformer-cnn . The repository also has a standalone program for QSAR prognosis which calculates individual atoms contributions, thus interpreting the model’s result. OCHEM [ 3 ] environment ( https://ochem.eu ) hosts the on-line implementation of the method proposed.

[1]  Wojciech Samek,et al.  Explainable AI: Interpreting, Explaining and Visualizing Deep Learning , 2019, Explainable AI.

[2]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[3]  Igor V Tetko,et al.  A renaissance of neural networks in drug discovery , 2016, Expert opinion on drug discovery.

[4]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[5]  Igor V. Tetko,et al.  ToxCast EPA in Vitro to in Vivo Challenge: Insight into the Rank-I Model , 2016, Chemical research in toxicology.

[6]  Igor V. Tetko,et al.  PLS-Optimal: A Stepwise D-Optimal Design Based on Latent Variables , 2012, J. Chem. Inf. Model..

[7]  Hiroshi Nakajima,et al.  Janus kinase 3 (Jak3) is essential for common cytokine receptor γ chain (γc)-dependent signaling: comparative analysis of γc, Jak3, and γc and Jak3 double-deficient mice , 2000 .

[8]  Igor V. Tetko,et al.  Focused Library Generator: case of Mdmx inhibitors , 2019, Journal of Computer-Aided Molecular Design.

[9]  Igor V. Tetko,et al.  Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information , 2011, J. Comput. Aided Mol. Des..

[10]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[11]  Igor V. Tetko,et al.  Augmentation Is What You Need! , 2019, ICANN.

[12]  Thierry Kogej,et al.  Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks , 2017, ArXiv.

[13]  Léon Bottou,et al.  Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics , 2014, EMNLP.

[14]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[15]  Gregg D. Wilensky,et al.  Neural Network Studies , 1993 .

[16]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[17]  Klaus-Robert Müller,et al.  Towards Explainable Artificial Intelligence , 2019, Explainable AI.

[18]  Yuedong Yang,et al.  Identifying Structure-Property Relationships through SMILES Syntax Analysis with Self-Attention Mechanism , 2018, J. Chem. Inf. Model..

[19]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[20]  Scott Boyer,et al.  Choosing Feature Selection and Learning Algorithms in QSAR , 2014, J. Chem. Inf. Model..

[21]  Igor V. Tetko,et al.  Synergy Effect between Convolutional Neural Networks and the Multiplicity of SMILES for Improvement of Molecular Prediction , 2018, ArXiv.

[22]  David Vidal,et al.  LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities , 2005, J. Chem. Inf. Model..

[23]  Jeffrey J. Sutherland,et al.  Three-dimensional quantitative structure-activity and structure-selectivity relationships of dihydrofolate reductase inhibitors , 2004, J. Comput. Aided Mol. Des..

[24]  Igor I. Baskin,et al.  Chapter 1:Fragment Descriptors in SAR/QSAR/QSPR Studies, Molecular Similarity Analysis and in Virtual Screening , 2008 .

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[27]  Wojciech Czarnecki,et al.  Learning to SMILE(S) , 2016, ArXiv.

[28]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[29]  Igor V Tetko,et al.  Identifying potential endocrine disruptors among industrial chemicals and their metabolites--development and evaluation of in silico tools. , 2015, Chemosphere.

[30]  Frank Noé,et al.  Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations , 2018 .

[31]  Jinfeng Yi,et al.  Edge Attention-based Multi-Relational Graph Convolutional Networks , 2018, ArXiv.

[32]  Christopher A. Hunter,et al.  Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction , 2018, ACS central science.

[33]  Igor V. Tetko,et al.  Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set , 2010, J. Chem. Inf. Model..

[34]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[35]  Igor V. Tetko,et al.  Estimation of Aqueous Solubility of Chemical Compounds Using E-State Indices , 2001, J. Chem. Inf. Comput. Sci..

[36]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[37]  Klaus-Robert Müller,et al.  Layer-Wise Relevance Propagation: An Overview , 2019, Explainable AI.

[38]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[39]  Igor V. Tetko,et al.  Neural Network Modeling for Estimation of Partition Coefficient Based on Atom-Type Electrotopological State Indices , 2000, J. Chem. Inf. Comput. Sci..

[40]  Igor V. Tetko,et al.  Modeling the Biodegradability of Chemical Compounds Using the Online CHEmical Modeling Environment (OCHEM) , 2013, Molecular informatics.

[41]  Abhinav Vishnu,et al.  SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties , 2017, ArXiv.

[42]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[43]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[44]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[45]  Alexander M. Rush,et al.  The Annotated Transformer , 2018 .

[46]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[47]  Petra Schneider,et al.  Generative Recurrent Networks for De Novo Drug Design , 2017, Molecular informatics.

[48]  Stephen R. Heller,et al.  InChI - the worldwide chemical structure identifier standard , 2013, Journal of Cheminformatics.

[49]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[50]  Igor V. Tetko,et al.  Neural Network Studies, 2. Variable Selection , 1996, J. Chem. Inf. Comput. Sci..

[51]  Sergey Sosnin,et al.  Comparative Study of Multitask Toxicity Modeling on a Broad Chemical Space , 2018, J. Chem. Inf. Model..

[52]  Masato Hagiwara,et al.  Proceedings of Workshop for NLP Open Source Software (NLP-OSS) , 2018 .

[53]  Regina Barzilay,et al.  Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction , 2017, J. Chem. Inf. Model..

[54]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[55]  J. Huuskonen,et al.  Estimation of Aqueous Solubility for a Diverse Set of Organic Compounds Based on Molecular Topology. , 2010 .

[56]  Igor V. Tetko,et al.  A Transformer Model for Retrosynthesis , 2019, ICANN.

[57]  Abhinav Vishnu,et al.  Using Rule-Based Labels for Weak Supervised Learning: A ChemNet for Transferable Chemical Property Prediction , 2017, KDD.

[58]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[59]  Igor V. Tetko,et al.  How Accurately Can We Predict the Melting Points of Drug-like Compounds? , 2014, J. Chem. Inf. Model..

[60]  Joseph Gomes,et al.  MoleculeNet: a benchmark for molecular machine learning† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc02664a , 2017, Chemical science.

[61]  Manuela Pavan,et al.  DRAGON SOFTWARE: AN EASY APPROACH TO MOLECULAR DESCRIPTOR CALCULATIONS , 2006 .

[62]  Klaus-Robert Müller,et al.  Explaining Recurrent Neural Network Predictions in Sentiment Analysis , 2017, WASSA@EMNLP.

[63]  Igor V. Tetko,et al.  Associative Neural Network , 2002, Neural Processing Letters.

[64]  Thierry Kogej,et al.  Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks , 2017, ACS central science.

[65]  Alja Plošnik,et al.  Mutagenic and carcinogenic structural alerts and their mechanisms of action , 2016, Arhiv za higijenu rada i toksikologiju.