A natural language processing approach based on embedding deep learning from heterogeneous compounds for quantitative structure–activity relationship modeling

Over the past decade, rapid development in biological and chemical technologies such as high‐throughput screening, parallel synthesis, has been significantly increased the amount of data, which requires the creation and the integration of new analytical methods, especially deep learning models. Recently, there is an increasing interest in deep learning utilization in computer‐aided drug discovery due to its exceptional successful application in many fields. The present work proposed a natural language processing approach, based on embedding deep neural networks. Our method aims to transform the Simplified Molecular Input Line Entry System format into word embedding vectors to represent the semantics of compounds. These vectors are fed into supervised machine learning algorithms such as convolutional long short‐term memory neural network, support vector machine, and random forest to build up quantitative structure–activity relationship models on toxicity data sets. The obtained results on toxicity data to the ciliate Tetrahymena pyriformis (IGC50), and acute toxicity rat data expressed as median lethal dose of treated rats (LD50) show that our approach can eventually be used to predict the activities of chemical compounds efficiently. All material used in this study is available online through the GitHub portal (https://github.com/BoukeliaAbdelbasset/NLPDeepQSAR.git).

[1]  Alex Alves Freitas,et al.  Pre-processing Feature Selection for Improved C&RT Models for Oral Absorption , 2013, J. Chem. Inf. Model..

[2]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[3]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[4]  Mark T D Cronin,et al.  Comparative assessment of methods to develop QSARs for the prediction of the toxicity of phenols to Tetrahymena pyriformis. , 2002, Chemosphere.

[5]  ChangKyoo Yoo,et al.  Deep learning driven QSAR model for environmental toxicology: Effects of endocrine disrupting chemicals on human health. , 2019, Environmental pollution.

[6]  Sabrina Jaeger,et al.  Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition , 2018, J. Chem. Inf. Model..

[7]  Kunal Roy,et al.  Current approaches for choosing feature selection and learning algorithms in quantitative structure–activity relationships (QSAR) , 2018, Expert opinion on drug discovery.

[8]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[9]  J. Panteleev,et al.  Recent applications of machine learning in medicinal chemistry. , 2018, Bioorganic & medicinal chemistry letters.

[10]  Fiorella Cravero,et al.  Hybridizing Feature Selection and Feature Learning Approaches in QSAR Modeling for Drug Discovery , 2017, Scientific Reports.

[11]  Alistair A. Young,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2017, MICCAI 2017.

[12]  Emilio Benfenati,et al.  Simplified Molecular Input‐Line Entry System and International Chemical Identifier in the QSAR Analysis of Styrylquinoline Derivatives as HIV‐1 Integrase Inhibitors , 2011, Chemical biology & drug design.

[13]  Emilio Benfenati,et al.  Simplified Molecular Input Line Entry System‐Based Optimal Descriptors: Quantitative Structure–Activity Relationship Modeling Mutagenicity of Nitrated Polycyclic Aromatic Hydrocarbons , 2009, Chemical biology & drug design.

[14]  Yoshihiro Uesawa Quantitative structure-activity relationship analysis using deep learning based on a novel molecular image input technique. , 2018, Bioorganic & medicinal chemistry letters.

[15]  Kunal Roy,et al.  QSAR by LFER model of HIV protease inhibitor mannitol derivatives using FA-MLR, PCRA, and PLS techniques. , 2006, Bioorganic & medicinal chemistry.

[16]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Tao Lu,et al.  Prediction of hERG K+ channel blockage using deep neural networks , 2019, Chemical biology & drug design.

[18]  William Stafford Noble,et al.  Support vector machine , 2013 .

[19]  Honglak Lee,et al.  Unsupervised learning of hierarchical representations with convolutional deep belief networks , 2011, Commun. ACM.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Esben Jannik Bjerrum,et al.  SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules , 2017, ArXiv.

[22]  S. Joshua Swamidass,et al.  Modeling Reactivity to Biological Macromolecules with a Deep Multitask Network , 2016, ACS central science.

[23]  Yutaka Saito,et al.  Convolutional neural network based on SMILES representation of compounds for detecting chemical motif , 2018, BMC Bioinformatics.

[24]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[25]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[26]  Ruisheng Zhang,et al.  QSAR Study of Ethyl 2-[(3-Methyl-2, 5-dioxo(3-pyrrolinyl))amino]-4-(trifluoromethyl) pyrimidine-5-carboxylate: An Inhibitor of AP-1 and NF-B Mediated Gene Expression Based on Support Vector Machines , 2003, J. Chem. Inf. Comput. Sci..

[27]  Alexander Tropsha,et al.  Quantitative structure-activity relationship modeling of rat acute toxicity by oral exposure. , 2009, Chemical research in toxicology.

[28]  Sung Jin Cho,et al.  Genetic Algorithm Guided Selection: Variable Selection and Subset Selection , 2002, J. Chem. Inf. Comput. Sci..

[29]  Bin Chen,et al.  Comparison of Random Forest and Pipeline Pilot Naïve Bayes in Prospective QSAR Predictions , 2012, J. Chem. Inf. Model..

[30]  Kunal Roy,et al.  Comparative QSARs for antimalarial endochins: Importance of descriptor-thinning and noise reduction prior to feature selection , 2011 .

[31]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[32]  Luhua Lai,et al.  Deep Learning for Drug-Induced Liver Injury , 2015, J. Chem. Inf. Model..

[33]  Manuela Pavan,et al.  DRAGON SOFTWARE: AN EASY APPROACH TO MOLECULAR DESCRIPTOR CALCULATIONS , 2006 .

[34]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[36]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[37]  Artem Cherkasov,et al.  Toxic Colors: The Use of Deep Learning for Predicting Toxicity of Compounds Merely from Their Graphic Images , 2018, J. Chem. Inf. Model..

[38]  Alireza Mehridehnavi,et al.  Deep neural network in QSAR studies using deep belief network , 2018, Appl. Soft Comput..

[39]  Tingjun Hou,et al.  ADMET evaluation in drug discovery: 15. Accurate prediction of rat oral acute toxicity using relevance vector machine and consensus modeling , 2016, Journal of Cheminformatics.

[40]  K. Roy,et al.  QSTR with extended topochemical atom (ETA) indices. 12. QSAR for the toxicity of diverse aromatic compounds to Tetrahymena pyriformis using chemometric tools. , 2009, Chemosphere.

[41]  Guo-Wei Wei,et al.  Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks , 2017, J. Chem. Inf. Model..

[42]  E Benfenati,et al.  A large comparison of integrated SAR/QSAR models of the Ames test for mutagenicity$ , 2018, SAR and QSAR in environmental research.

[43]  E. Benfenati,et al.  QSAR modelling of the toxicity to Tetrahymena pyriformis by balance of correlations , 2010, Molecular Diversity.

[44]  Abdul Sattar,et al.  Efficient Toxicity Prediction via Simple Features Using Shallow Neural Networks and Decision Trees , 2019, ACS Omega.

[45]  Scott Boyer,et al.  Choosing Feature Selection and Learning Algorithms in QSAR , 2014, J. Chem. Inf. Model..

[46]  R. Harkness Book review , 1992, Journal of Inherited Metabolic Disease.