Exploring Chemical Space using Natural Language Processing Methodologies for Drug Discovery

Text-based representations of chemicals and proteins can be thought of as unstructured languages codified by humans to describe domain-specific knowledge. Advances in natural language processing (NLP) methodologies in the processing of spoken languages accelerated the application of NLP to elucidate hidden knowledge in textual representations of these biochemical entities and then use it to construct models to predict molecular properties or to design novel molecules. This review outlines the impact made by these advances on drug discovery and aims to further the dialogue between medicinal chemists and computer scientists.

[1]  M. Prunotto,et al.  Opportunities and challenges in phenotypic drug discovery: an industry perspective , 2017, Nature Reviews Drug Discovery.

[2]  Frank Noé,et al.  Learning Continuous and Data-Driven Molecular Descriptors by Translating Equivalent Chemical Representations , 2018 .

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  Gisbert Schneider,et al.  Tuning artificial intelligence on the de novo design of natural-product-inspired retinoid X receptor modulators , 2018, Communications Chemistry.

[5]  A. Cheema,et al.  Small Changes Huge Impact: The Role of Protein Posttranslational Modifications in Cellular Homeostasis and Disease , 2011, Journal of amino acids.

[6]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[7]  Stephen Dunn Smiles , 1932 .

[8]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[9]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[10]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[11]  Wei-keng Liao,et al.  CheMixNet: Mixed DNN Architectures for Predicting Chemical Properties using Multiple Molecular Representations , 2018, ArXiv.

[12]  Junzhou Huang,et al.  SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction , 2019, BCB.

[13]  Ajay N. Jain,et al.  Effects of inductive bias on computational evaluations of ligand-based modeling and on drug discovery , 2008, J. Comput. Aided Mol. Des..

[14]  Jin Woo Kim,et al.  Molecular generative model based on conditional variational autoencoder for de novo molecular design , 2018, Journal of Cheminformatics.

[15]  Mirella Lapata,et al.  Text Generation from Knowledge Graphs with Graph Transformers , 2019, NAACL.

[16]  Yan Wang,et al.  DNN-Dom: predicting protein domain boundary from sequence alone by deep neural network , 2019, Bioinform..

[17]  Sabrina Jaeger,et al.  Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition , 2018, J. Chem. Inf. Model..

[18]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[19]  Jun Cheng,et al.  The Kipoi repository accelerates community exchange and reuse of predictive models for genomics , 2019, Nature Biotechnology.

[20]  Sungroh Yoon,et al.  How Generative Adversarial Networks and Their Variants Work , 2017, ACM Comput. Surv..

[21]  Stephen R. Heller,et al.  InChI - the worldwide chemical structure identifier standard , 2013, Journal of Cheminformatics.

[22]  Marwin H. S. Segler,et al.  GuacaMol: Benchmarking Models for De Novo Molecular Design , 2018, J. Chem. Inf. Model..

[23]  Arzucan Özgür,et al.  ChemBoost: A Chemical Language Based Approach for Protein – Ligand Binding Affinity Prediction , 2018, Molecular informatics.

[24]  Juno Nam,et al.  Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions , 2016, ArXiv.

[25]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[26]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[27]  Amedeo Caflisch,et al.  Protein structure-based drug design: from docking to molecular dynamics. , 2018, Current opinion in structural biology.

[28]  Wei Chen,et al.  Predicting protein structural classes for low-similarity sequences by evaluating different features , 2019, Knowl. Based Syst..

[29]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[30]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[31]  Malay Kumar Basu,et al.  Grammar of protein domain architectures , 2019, Proceedings of the National Academy of Sciences.

[32]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[33]  Akshay Deepak,et al.  Deep Robust Framework for Protein Function Prediction Using Variable-Length Protein Sequences , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[35]  Constantine Bekas,et al.  “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models† †Electronic supplementary information (ESI) available: Time-split test set and example predictions, together with attention weights, confidence and token probabilities. See DO , 2017, Chemical science.

[36]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[37]  Russ B Altman,et al.  Machine learning in chemoinformatics and drug discovery. , 2018, Drug discovery today.

[38]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[39]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[40]  Stephen R. Heller,et al.  InChI, the IUPAC International Chemical Identifier , 2015, Journal of Cheminformatics.

[41]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[42]  Thierry Kogej,et al.  Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks , 2017, ACS central science.

[43]  Matt J. Kusner,et al.  Grammar Variational Autoencoder , 2017, ICML.

[44]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[45]  A. Valencia,et al.  Information Retrieval and Text Mining Technologies for Chemistry. , 2017, Chemical reviews.

[46]  Arzucan Özgür,et al.  A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction , 2016, BMC Bioinformatics.

[47]  Alán Aspuru-Guzik,et al.  Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models , 2017, ArXiv.

[48]  Shashi Narayan,et al.  Leveraging Pre-trained Checkpoints for Sequence Generation Tasks , 2019, Transactions of the Association for Computational Linguistics.

[49]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[50]  Guoyin Wang,et al.  Topic-Guided Variational Auto-Encoder for Text Generation , 2019, NAACL.

[51]  Sutanu Chakraborti,et al.  Protein Word Detection using Text Segmentation Techniques , 2017, BioNLP.

[52]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[53]  Arzucan Özgür,et al.  DeepDTA: deep drug–target binding affinity prediction , 2018, Bioinform..

[54]  Jie Hou,et al.  DeepSF: deep convolutional neural network for mapping protein sequences to folds , 2017, Bioinform..

[55]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[56]  Christian Biemann,et al.  What do we need to build explainable AI systems for the medical domain? , 2017, ArXiv.

[57]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[58]  Steven Skiena,et al.  Syntax-Directed Variational Autoencoder for Molecule Generation , 2017 .

[59]  Ola Engkvist,et al.  A de novo molecular generation method using latent vector based generative adversarial network , 2019, J. Cheminformatics.

[60]  Gisbert Schneider,et al.  De Novo Design of Bioactive Small Molecules by Artificial Intelligence , 2018, Molecular informatics.

[61]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[62]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[63]  Ola Engkvist,et al.  Randomized SMILES strings improve the quality of molecular generative models , 2019, Journal of Cheminformatics.

[64]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[65]  Michael Levitt,et al.  The language of the protein universe. , 2015, Current opinion in genetics & development.

[66]  Daniel C. Elton,et al.  Deep learning for molecular generation and optimization - a review of the state of the art , 2019, Molecular Systems Design & Engineering.

[67]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[68]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[69]  Matthias Rarey,et al.  On the Art of Compiling and Using 'Drug‐Like' Chemical Fragment Spaces , 2008, ChemMedChem.

[70]  Abhinav Vishnu,et al.  SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties , 2017, ArXiv.

[71]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[72]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[73]  Esben Jannik Bjerrum,et al.  SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules , 2017, ArXiv.

[74]  Alán Aspuru-Guzik,et al.  Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models , 2018, Frontiers in Pharmacology.

[75]  Kyunghyun Cho,et al.  Conditional molecular design with deep generative models , 2018, J. Chem. Inf. Model..

[76]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[77]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[78]  Olexandr Isayev,et al.  Deep reinforcement learning for de novo drug design , 2017, Science Advances.

[79]  Fabrício F. Costa,et al.  Rare genetic diseases: update on diagnosis, treatment and online resources. , 2018, Drug discovery today.

[80]  Andrew R. Leach,et al.  Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery , 2019, Journal of Cheminformatics.

[81]  Connor W. Coley,et al.  A graph-convolutional neural network model for the prediction of chemical reactivity , 2018, Chemical science.

[82]  Koji Tsuda,et al.  ChemTS: an efficient python library for de novo molecular generation , 2017, Science and technology of advanced materials.

[83]  Gisbert Schneider,et al.  Scaffold hopping from natural products to synthetic mimetics by holistic molecular similarity , 2018, Communications Chemistry.

[84]  David Vidal,et al.  LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities , 2005, J. Chem. Inf. Model..

[85]  James G. Nourse,et al.  Reoptimization of MDL Keys for Use in Drug Discovery , 2002, J. Chem. Inf. Comput. Sci..

[86]  Nicola De Cao,et al.  MolGAN: An implicit generative model for small molecular graphs , 2018, ArXiv.

[87]  Mario Gimona,et al.  Protein linguistics — a grammar for modular protein assembly? , 2006, Nature Reviews Molecular Cell Biology.

[88]  Andrea Cadeddu,et al.  Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. , 2014, Angewandte Chemie.

[89]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[90]  Abhinav Vishnu,et al.  Chemception: A Deep Neural Network with Minimal Chemistry Knowledge Matches the Performance of Expert-developed QSAR/QSPR Models , 2017, ArXiv.

[91]  Sebastian Ruder,et al.  Neural transfer learning for natural language processing , 2019 .

[92]  Maciej Eder,et al.  Linguistic measures of chemical diversity and the “keywords” of molecular collections , 2018, Scientific Reports.

[93]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[94]  Lorenz C. Blum,et al.  970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. , 2009, Journal of the American Chemical Society.

[95]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[96]  Suman K. Chakravarti,et al.  Distributed Representation of Chemical Fragments , 2018, ACS omega.

[97]  Matt J. Kusner,et al.  A Model to Search for Synthesizable Molecules , 2019, NeurIPS.

[98]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[99]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[100]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[101]  Plamen Angelov,et al.  RetroTransformDB: A Dataset of Generic Transforms for Retrosynthetic Analysis , 2018, Data.

[102]  Anshul Kundaje,et al.  Prediction of protein-ligand interactions from paired protein sequence motifs and ligand substructures , 2018, PSB.

[103]  Petra Schneider,et al.  De Novo Design at the Edge of Chaos. , 2016, Journal of medicinal chemistry.

[104]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[105]  Morikazu Nakamura,et al.  Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach , 2012, PLoS ONE.

[106]  E. GARFIELD Chemico-Linguistics: Computer Translation of Chemical Nomenclature , 1961, Nature.

[107]  Yaoqi Zhou,et al.  Getting to Know Your Neighbor: Protein Structure Prediction Comes of Age with Contextual Machine Learning , 2020, J. Comput. Biol..

[108]  Igor V. Tetko,et al.  Synergy Effect between Convolutional Neural Networks and the Multiplicity of SMILES for Improvement of Molecular Prediction , 2018, ArXiv.

[109]  Riccardo Petraglia,et al.  Predicting retrosynthetic pathways using a combined linguistic model and hyper-graph exploration strategy , 2019 .

[110]  Friedrich Rippmann,et al.  Interpretable Deep Learning in Drug Discovery , 2019, Explainable AI.

[111]  Matthias Rarey,et al.  In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening , 2019, J. Chem. Inf. Model..

[112]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[113]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[114]  Pascal Friederich,et al.  Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation , 2019, Mach. Learn. Sci. Technol..

[115]  Sungroh Yoon,et al.  DeepCCI: End-to-end Deep Learning for Chemical-Chemical Interaction Prediction , 2017, BCB.

[116]  Regina Barzilay,et al.  Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network , 2017, NIPS.

[117]  Alán Aspuru-Guzik,et al.  SELFIES: a robust representation of semantically constrained graphs with an example application in chemistry , 2019, ArXiv.

[118]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[119]  Mathieu d'Aquin,et al.  Leveraging Ontologies for Knowledge Graph Schemas , 2019, KGB@ESWC.

[120]  Jacob D. Durrant,et al.  Dimorphite-DL: an open-source program for enumerating the ionization states of drug-like small molecules , 2019, Journal of Cheminformatics.

[121]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[122]  Xin Wen,et al.  BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities , 2006, Nucleic Acids Res..

[123]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[124]  Petra Schneider,et al.  Generative Recurrent Networks for De Novo Drug Design , 2017, Molecular informatics.

[125]  Jürgen Bajorath,et al.  Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. , 2007, Drug discovery today.

[126]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[127]  Dongsup Kim,et al.  FP2VEC: a new molecular featurizer for learning molecular properties , 2019, Bioinform..

[128]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[129]  Gerhard Weikum,et al.  KnowLife: a versatile approach for constructing a large knowledge graph for biomedical sciences , 2015, BMC Bioinformatics.

[130]  Lisa Peltason,et al.  Molecular Similarity Analysis in Virtual Screening , 2009 .

[131]  Arzucan Özgür,et al.  A novel methodology on distributed representations of proteins using their interacting ligands , 2018, Bioinform..

[132]  Thomas Blaschke,et al.  Molecular de-novo design through deep reinforcement learning , 2017, Journal of Cheminformatics.

[133]  Jean-Louis Reymond,et al.  SMIfp (SMILES fingerprint) Chemical Space for Virtual Screening and Visualization of Large Databases of Organic Molecules , 2013, J. Chem. Inf. Model..

[134]  Eric J. Martin,et al.  In silico generation of novel, drug-like chemical matter using the LSTM neural network , 2017, ArXiv.

[135]  Michael M. Hann,et al.  RECAP-Retrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry , 1998, J. Chem. Inf. Comput. Sci..

[136]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[137]  Gisbert Schneider,et al.  Automating drug discovery , 2017, Nature Reviews Drug Discovery.

[138]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[139]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[140]  Alpha A. Lee,et al.  Bayesian semi-supervised learning for uncertainty-calibrated prediction of molecular properties and active learning , 2019, Chemical science.

[141]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[142]  Alán Aspuru-Guzik,et al.  Neural Networks for the Prediction of Organic Chemistry Reactions , 2016, ACS central science.

[143]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[144]  Thomas Blaschke,et al.  Exploring the GDB-13 chemical space using deep generative models , 2018, Journal of Cheminformatics.

[145]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[146]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[147]  Matthieu J. Miossec,et al.  Integration of target discovery, drug discovery and drug delivery: A review on computational strategies. , 2019, Wiley interdisciplinary reviews. Nanomedicine and nanobiotechnology.

[148]  Joseph Gomes,et al.  MoleculeNet: a benchmark for molecular machine learning† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc02664a , 2017, Chemical science.

[149]  Eugene I Shakhnovich,et al.  OpenGrowth: An Automated and Rational Algorithm for Finding New Protein Ligands. , 2016, Journal of medicinal chemistry.

[150]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[151]  Mona Singh,et al.  molBLOCKS: decomposing small molecule sets and uncovering enriched fragments , 2014, Bioinform..

[152]  Xiao Li,et al.  A High Efficient Biological Language Model for Predicting Protein–Protein Interactions , 2019, Cells.

[153]  Thomas Blaschke,et al.  Application of Generative Autoencoder in De Novo Molecular Design , 2017, Molecular informatics.

[154]  Elif Ozkirimli,et al.  WideDTA: prediction of drug-target binding affinity , 2019, ArXiv.

[155]  David Baker,et al.  Macromolecular modeling with rosetta. , 2008, Annual review of biochemistry.

[156]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[157]  Ola Spjuth,et al.  Prediction of Metabolic Transformations using Cross Venn-ABERS Predictors , 2017, COPA.

[158]  Christopher A. Hunter,et al.  Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction , 2018, ACS central science.

[159]  Ping Zhang,et al.  Interpretable Drug Target Prediction Using Deep Neural Representation , 2018, IJCAI.

[160]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[161]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[162]  Daniel W. A. Buchan,et al.  Inferring Protein Domain Semantic Roles Using word2vec , 2019 .

[163]  Bonggun Shin,et al.  Self-Attention Based Molecule Representation for Predicting Drug-Target Interaction , 2019, MLHC.

[164]  Renxiao Wang,et al.  The PDBbind database: methodologies and updates. , 2005, Journal of medicinal chemistry.

[165]  Yurii S. Moroz,et al.  Ultra-large library docking for discovering new chemotypes , 2019, Nature.

[166]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[167]  Noel M. O'Boyle,et al.  DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures , 2018 .

[168]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[169]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[170]  Bowen Liu,et al.  Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models , 2017, ACS central science.