GEN: highly efficient SMILES explorer using autodidactic generative examination networks

Recurrent neural networks have been widely used to generate millions of de novo molecules in a known chemical space. These deep generative models are typically setup with LSTM or GRU units and trained with canonical SMILEs. In this study, we introduce a new robust architecture, Generative Examination Networks GEN, based on bidirectional RNNs with concatenated sub-models to learn and generate molecular SMILES with a trained target space. GENs autonomously learn the target space in a few epochs while being subjected to an independent online examination mechanism to measure the quality of the generated set. Here we have used online statistical quality control (SQC) on the percentage of valid molecules SMILES as an examination measure to select the earliest available stable model weights. Very high levels of valid SMILES (95-98%) can be generated using multiple parallel encoding layers in combination with SMILES augmentation using unrestricted SMILES randomization. Our architecture combines an excellent novelty rate (85-90%) while generating SMILES with a strong conservation of the property space (95-99%). Our flexible examination mechanism is open to other quality criteria.

[1]  J. Reymond The chemical space project. , 2015, Accounts of chemical research.

[2]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[3]  George Papadatos,et al.  The ChEMBL database in 2017 , 2016, Nucleic Acids Res..

[4]  Noel M. O'Boyle,et al.  DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures , 2018 .

[5]  Esben Jannik Bjerrum,et al.  Molecular Generation with Recurrent Neural Networks (RNNs) , 2017, ArXiv.

[6]  Daniel C. Elton,et al.  Deep learning for molecular generation and optimization - a review of the state of the art , 2019, Molecular Systems Design & Engineering.

[7]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[8]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[9]  Stephen R. Heller,et al.  InChI, the IUPAC International Chemical Identifier , 2015, Journal of Cheminformatics.

[10]  Eugene L. Grant,et al.  Statistical Quality Control , 1946 .

[11]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[12]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[13]  Igor V. Tetko,et al.  Neural network studies, 1. Comparison of overfitting and overtraining , 1995, J. Chem. Inf. Comput. Sci..

[14]  Marwin H. S. Segler,et al.  GuacaMol: Benchmarking Models for De Novo Molecular Design , 2018, J. Chem. Inf. Model..

[15]  Marcus Gastreich,et al.  The next level in chemical space navigation: going far beyond enumerable compound libraries. , 2019, Drug discovery today.

[16]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[17]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[18]  Lorenz C. Blum,et al.  Chemical space as a source for new drugs , 2010 .

[19]  Evan Bolton,et al.  PubChem 2019 update: improved access to chemical data , 2018, Nucleic Acids Res..

[20]  Petra Schneider,et al.  Generative Recurrent Networks for De Novo Drug Design , 2017, Molecular informatics.

[21]  Ola Engkvist,et al.  Randomized SMILES strings improve the quality of molecular generative models , 2019, Journal of Cheminformatics.

[22]  David Weininger,et al.  SMILES. 2. Algorithm for generation of unique SMILES notation , 1989, J. Chem. Inf. Comput. Sci..

[23]  Lilian Weng,et al.  From GAN to WGAN , 2019, ArXiv.

[24]  Roger A. Sayle,et al.  Get Your Atoms in Order - An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm , 2015, J. Chem. Inf. Model..

[25]  Thomas Blaschke,et al.  The rise of deep learning in drug discovery. , 2018, Drug discovery today.

[26]  Igor V. Tetko,et al.  Synergy Effect between Convolutional Neural Networks and the Multiplicity of SMILES for Improvement of Molecular Prediction , 2018, ArXiv.

[27]  John J. Irwin,et al.  ZINC 15 – Ligand Discovery for Everyone , 2015, J. Chem. Inf. Model..

[28]  Thomas Blaschke,et al.  Exploring the GDB-13 chemical space using deep generative models , 2018, Journal of Cheminformatics.

[29]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[30]  Jean-Louis Reymond,et al.  Expanding the fragrance chemical space for virtual screening , 2014, Journal of Cheminformatics.

[31]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[32]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[33]  Eric J. Martin,et al.  In silico generation of novel, drug-like chemical matter using the LSTM neural network , 2017, ArXiv.

[34]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.