Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules

Computational methods have become indispensable tools to accelerate the drug discovery process and alleviate the excessive dependence on time-consuming and labor-intensive experiments. Traditional feature-engineering approaches heavily rely on expert knowledge to devise useful features, which could be costly and sometimes biased. The emerging deep learning (DL) methods deliver a data-driven method to automatically learn expressive representations from complex raw data. Inspired by this, researchers have attempted to apply various deep neural network models to simplified molecular input line entry specification (SMILES) strings, which contain all the composition and structure information of molecules. However, current models usually suffer from the scarcity of labeled data. This results in a low generalization ability of SMILES-based DL models, which prevents them from competing with the state-of-the-art computational methods. In this study, we utilized the BiLSTM (bidirectional long short term merory) attention network (BAN) in which we employed a novel multi-step attention mechanism to facilitate the extracting of key features from the SMILES strings. Meanwhile, SMILES enumeration was utilized as a data augmentation method in the training phase to substantially increase the number of labeled data and enlarge the probability of mining more patterns from complex SMILES. We again took advantage of SMILES enumeration in the prediction phase to rectify model prediction bias and provide a more accurate prediction. Combined with the BAN model, our strategies can greatly improve the performance of latent features learned from SMILES strings. In 11 canonical absorption, distribution, metabolism, excretion and toxicity-related tasks, our method outperformed the state-of-the-art approaches.

[1]  Horacio Pérez-Sánchez,et al.  A review of ligand-based virtual screening web tools and screening algorithms in large molecular databases in the age of big data. , 2018, Future medicinal chemistry.

[2]  Jon Atli Benediktsson,et al.  Automatic selection of molecular descriptors using random forest: Application to drug discovery , 2017, Expert Syst. Appl..

[3]  Alex Alves Freitas,et al.  A new approach for interpreting Random Forest models and its application to the biology of ageing , 2018, Bioinform..

[4]  Xiaomin Luo,et al.  Pushing the boundaries of molecular representation for drug discovery with graph attention mechanism. , 2020, Journal of medicinal chemistry.

[5]  Yaohang Li,et al.  Biomedical data and computational models for drug repositioning: a comprehensive review , 2020, Briefings Bioinform..

[6]  Tatsuya Akutsu,et al.  LBSizeCleav: improved support vector machine (SVM)-based prediction of Dicer cleavage sites using loop/bulge length , 2016, BMC Bioinformatics.

[7]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[8]  COD Optimization Prediction Model Based on CAWOA-ELM in Water Ecological Environment , 2021 .

[9]  Eugene N. Muratov,et al.  QSAR-Based Virtual Screening: Advances and Applications in Drug Discovery , 2018, Front. Pharmacol..

[10]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Faisal Saeed,et al.  Bioactive Molecule Prediction Using Extreme Gradient Boosting , 2016, Molecules.

[13]  Igor V. Tetko,et al.  Transformer-CNN: Swiss knife for QSAR modeling and interpretation , 2020, Journal of Cheminformatics.

[14]  Aiping Lu,et al.  ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties , 2021, Nucleic Acids Res..

[15]  Erwan Scornet,et al.  A random forest guided tour , 2015, TEST.

[16]  Vijay S. Pande,et al.  Low Data Drug Discovery with One-Shot Learning , 2016, ACS central science.

[17]  David Weininger,et al.  SMILES. 2. Algorithm for generation of unique SMILES notation , 1989, J. Chem. Inf. Comput. Sci..

[18]  Esben Jannik Bjerrum,et al.  SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules , 2017, ArXiv.

[19]  Wojciech Czarnecki,et al.  Learning to SMILE(S) , 2016, ArXiv.

[20]  Ramón García-Domenech,et al.  QSAR multi-target in drug discovery: a review. , 2014, Current computer-aided drug design.

[21]  Dong-Sheng Cao,et al.  ADMETlab: a platform for systematic ADMET evaluation based on a comprehensively collected ADMET database , 2018, Journal of Cheminformatics.

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[24]  Jerzy Leszczynski,et al.  Recent Advances of Computational Modeling for Predicting Drug Metabolism: A Perspective. , 2017, Current drug metabolism.

[25]  Hugo Ceulemans,et al.  Large-scale comparison of machine learning methods for drug target prediction on ChEMBL , 2018, Chemical science.

[26]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[27]  Chartchalerm Isarankura-Na-Ayudhya,et al.  A practical overview of quantitative structure-activity relationship , 2009 .

[28]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[29]  Dong-Sheng Cao,et al.  MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction , 2021, Briefings Bioinform..

[30]  Chenyou Fan,et al.  Survey of Convolutional Neural Network , 2016 .

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[33]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[34]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[35]  Samy Bengio,et al.  Order Matters: Sequence to sequence for sets , 2015, ICLR.

[36]  Ola Engkvist,et al.  Randomized SMILES strings improve the quality of molecular generative models , 2019, Journal of Cheminformatics.