Machine learning in chemoinformatics and drug discovery.

Chemoinformatics is an established discipline focusing on extracting, processing and extrapolating meaningful data from chemical structures. With the rapid explosion of chemical 'big' data from HTS and combinatorial synthesis, machine learning has become an indispensable tool for drug designers to mine chemical information from large compound databases to design drugs with important biological properties. To process the chemical data, we first reviewed multiple processing layers in the chemoinformatics pipeline followed by the introduction of commonly used machine learning models in drug discovery and QSAR analysis. Here, we present basic principles and recent case studies to demonstrate the utility of machine learning techniques in chemoinformatics analyses; and we discuss limitations and future directions to guide further development in this evolving field.

[1]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[2]  Z. Deng,et al.  Bridging chemical and biological space: "target fishing" using 2D and 3D molecular descriptors. , 2006, Journal of medicinal chemistry.

[3]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[4]  David B. Searls,et al.  Data integration: challenges for drug discovery , 2005, Nature Reviews Drug Discovery.

[5]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[6]  Tao Huang,et al.  MOST: most-similar ligand based approach to target prediction , 2017, BMC Bioinformatics.

[7]  Igor I. Baskin,et al.  Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? , 2012, J. Chem. Inf. Model..

[8]  J. Bajorath,et al.  Advancing the activity cliff concept , 2013 .

[9]  Russ B. Altman,et al.  3D deep convolutional neural networks for amino acid environment similarity analysis , 2017, BMC Bioinformatics.

[10]  Alexander Chuprina,et al.  Drug- and Lead-likeness, Target Class, and Molecular Diversity Analysis of 7.9 Million Commercially Available Organic Compounds Provided by 29 Suppliers , 2010, J. Chem. Inf. Model..

[11]  Paola Gramatica,et al.  Introduction General Considerations , 2022 .

[12]  Alexandre Varnek,et al.  Transductive Support Vector Machines: Promising Approach to Model Small and Unbalanced Datasets , 2013, Molecular informatics.

[13]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[14]  J V Gobburu,et al.  Artificial neural networks as a novel approach to integrated pharmacokinetic-pharmacodynamic analysis. , 1996, Journal of pharmaceutical sciences.

[15]  Guixia Liu,et al.  Performance Evaluation of 2D Fingerprint and 3D Shape Similarity Methods in Virtual Screening , 2012, J. Chem. Inf. Model..

[16]  Bin Chen,et al.  Comparison of Random Forest and Pipeline Pilot Naïve Bayes in Prospective QSAR Predictions , 2012, J. Chem. Inf. Model..

[17]  José L. Medina-Franco,et al.  Visualization of Molecular Fingerprints , 2011, J. Chem. Inf. Model..

[18]  J. Devillers Prediction of mammalian toxicity of organophosphorus pesticides from QSTR modeling , 2004, SAR and QSAR in environmental research.

[19]  Roberto Todeschini,et al.  Towards Global QSAR Model Building for Acute Toxicity: Munro Database Case Study , 2014, International journal of molecular sciences.

[20]  J. A. Grant,et al.  A shape-based 3-D scaffold hopping method and its application to a bacterial protein-protein interaction. , 2005, Journal of medicinal chemistry.

[21]  J. Friedman,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Response , 1993 .

[22]  Matthias W. Seeger,et al.  Gaussian Processes For Machine Learning , 2004, Int. J. Neural Syst..

[23]  Maykel Pérez González,et al.  Quantitative structure-activity relationship studies of HIV-1 integrase inhibition. 1. GETAWAY descriptors. , 2007, European journal of medicinal chemistry.

[24]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[25]  K. Marill Advanced statistics: linear regression, part II: multiple linear regression. , 2004, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[26]  Thomas Blaschke,et al.  The rise of deep learning in drug discovery. , 2018, Drug discovery today.

[27]  William Stafford Noble,et al.  Support vector machine , 2013 .

[28]  Robert Damoiseaux,et al.  3D Chemical Similarity Networks for Structure-Based Target Prediction and Scaffold Hopping. , 2016, ACS chemical biology.

[29]  Jameed Hussain,et al.  Computationally Efficient Algorithm to Identify Matched Molecular Pairs (MMPs) in Large Data Sets , 2010, J. Chem. Inf. Model..

[30]  Fernanda Borges,et al.  Combining QSAR classification models for predictive modeling of human monoamine oxidase inhibitors. , 2013, European journal of medicinal chemistry.

[31]  Russ B Altman,et al.  Shallow Representation Learning via Kernel PCA Improves QSAR Modelability , 2017, J. Chem. Inf. Model..

[32]  Hengzhi Liu,et al.  QSAR Study of Ethyl 2‐[(3‐Methyl‐2,5‐dioxo(3‐pyrrolinyl))amino] ‐4‐(trifluoromethyl)pyrimidine‐5‐carboxylate: An Inhibitor of AP‐1 and NF‐ϰB Mediated Gene Expression Based on Support Vector Machines. , 2003 .

[33]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[34]  I M Kapetanovic,et al.  Computer-aided drug discovery and development (CADDD): in silico-chemico-biological approach. , 2008, Chemico-biological interactions.

[35]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[37]  Hugo Kubinyi,et al.  Evolutionary variable selection in regression and PLS analyses , 1996 .

[38]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[39]  S. Wold,et al.  Comparative molecular field analysis , 1991 .

[40]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Kathrin Heikamp,et al.  Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets , 2011, J. Chem. Inf. Model..

[42]  Jürgen Bajorath,et al.  Molecular Similarity Concepts for Informatics Applications. , 2017, Methods in molecular biology.

[43]  Zakariya Yahya Algamal,et al.  High‐dimensional QSAR prediction of anticancer potency of imidazo[4,5‐b]pyridine derivatives using adjusted adaptive LASSO , 2015 .

[44]  Yoshihiro Yamanishi,et al.  Benchmarking a Wide Range of Chemical Descriptors for Drug‐Target Interaction Prediction Using a Chemogenomic Approach , 2014, Molecular informatics.

[45]  Qin Tong,et al.  Molecular fingerprint-based artificial neural networks QSAR for ligand biological activity predictions. , 2012, Molecular pharmaceutics.

[46]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[47]  Mutasem O. Taha,et al.  Elaborate Ligand-Based Modeling Coupled with Multiple Linear Regression and k Nearest Neighbor QSAR Analyses Unveiled New Nanomolar mTOR Inhibitors , 2013, J. Chem. Inf. Model..

[48]  P. Labute,et al.  Binary Quantitative Structure—Activity Relationship (QSAR) Analysis of Estrogen Receptor Ligands. , 1999 .

[49]  Yotam Hechtlinger,et al.  A Generalization of Convolutional Neural Networks to Graph-Structured Data , 2017, ArXiv.

[50]  Russ B. Altman,et al.  Flexible Analog Search with Kernel PCA Embedded Molecule Vectors , 2017, Computational and structural biotechnology journal.

[51]  Weida Tong,et al.  Mold2, Molecular Descriptors from 2D Structures for Chemoinformatics and Toxicoinformatics , 2008, J. Chem. Inf. Model..

[52]  Anton J. Hopfinger,et al.  4D-QSAR: Perspectives in Drug Design , 2010, Molecules.

[53]  Jens Meiler,et al.  Autocorrelation descriptor improvements for QSAR: 2DA_Sign and 3DA_Sign , 2016, Journal of Computer-Aided Molecular Design.

[54]  Andreas Bender,et al.  "Bayes Affinity Fingerprints" Improve Retrieval Rates in Virtual Screening and Define Orthogonal Bioactivity Space: When Are Multitarget Drugs a Feasible Concept? , 2006, J. Chem. Inf. Model..

[55]  Igor V Tetko,et al.  A renaissance of neural networks in drug discovery , 2016, Expert opinion on drug discovery.

[56]  Alan R. Katritzky,et al.  Quantum-Chemical Descriptors in QSAR/QSPR Studies , 1996 .

[57]  Amanda C. Schierz Virtual screening of bioassay data , 2009, J. Cheminformatics.

[58]  M. Mohammadhosseini,et al.  QSAR study of VEGFR-2 inhibitors by using genetic algorithm-multiple linear regressions (GA-MLR) and genetic algorithm-support vector machine (GA-SVM): a comparative approach , 2015, Medicinal Chemistry Research.

[59]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[60]  Igor I. Baskin,et al.  The continuous molecular fields approach to building 3D-QSAR models , 2013, Journal of Computer-Aided Molecular Design.

[61]  R. Cramer,et al.  Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. , 1988, Journal of the American Chemical Society.

[62]  P. Goodford A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. , 1985, Journal of medicinal chemistry.

[63]  David DeCaprio,et al.  Cheminformatics approaches to analyze diversity in compound screening libraries. , 2010, Current opinion in chemical biology.

[64]  Denis Fourches,et al.  Characterizing the Chemical Space of ERK2 Kinase Inhibitors Using Descriptors Computed from Molecular Dynamics Trajectories , 2017, J. Chem. Inf. Model..

[65]  Peter Willett,et al.  Promoting Access to White Rose Research Papers Effectiveness of Graph-based and Fingerprint-based Similarity Measures for Virtual Screening of 2d Chemical Structure Databases , 2022 .

[66]  Sergey Nikolenko,et al.  druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico. , 2017, Molecular pharmaceutics.

[67]  Yong Huang,et al.  Large-Scale Chemical Similarity Networks for Target Profiling of Compounds Identified in Cell-Based Chemical Screens , 2015, PLoS Comput. Biol..

[68]  Jeremy G. Vinter,et al.  FieldScreen: Virtual Screening Using Molecular Fields. Application to the DUD Data Set , 2008, J. Chem. Inf. Model..

[69]  P. Seiler,et al.  Steric and lipophobic components of the hydrophobic fragmental constant. , 1981, Arzneimittel-Forschung.

[70]  Wee Kiang Yeo,et al.  Extraction and validation of substructure profiles for enriching compound libraries , 2012, Journal of Computer-Aided Molecular Design.

[71]  Igor I Baskin,et al.  Chemoinformatics as a Theoretical Chemistry Discipline , 2011, Molecular informatics.

[72]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[73]  Roberto Todeschini,et al.  Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions , 2013, Journal of Cheminformatics.

[74]  Robert Damoiseaux,et al.  Computational Cell Cycle Profiling of Cancer Cells for Prioritizing FDA-Approved Drugs with Repurposing Potential , 2017, Scientific Reports.

[75]  Alexander Tropsha,et al.  Using Graph Indices for the Analysis and Comparison of Chemical Datasets , 2013, Molecular informatics.

[76]  Thomas Blaschke,et al.  Molecular de-novo design through deep reinforcement learning , 2017, Journal of Cheminformatics.

[77]  Igor I. Baskin,et al.  A Neural Device for Searching Direct Correlations Between Structures and Properties of Chemical Compounds. , 2010 .

[78]  V. Poroikov,et al.  Robustness of Biological Activity Spectra Predicting by Computer Program PASS for Noncongeneric Sets of Chemical Compounds , 2000, Journal of chemical information and computer sciences.

[79]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[80]  J. Bajorath Selected Concepts and Investigations in Compound Classification, Molecular Descriptor Analysis, and Virtual Screening , 2001 .

[81]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[82]  R. García-Domenech,et al.  Some new trends in chemical graph theory. , 2008, Chemical reviews.

[83]  João D. Ferreira,et al.  Semantic Similarity for Automatic Classification of Chemical Compounds , 2010, PLoS Comput. Biol..

[84]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[85]  Danishuddin,et al.  Descriptors and their selection methods in QSAR analysis: paradigm for drug design. , 2016, Drug discovery today.

[86]  Thierry Kogej,et al.  Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks , 2017, ACS central science.

[87]  B. Fan,et al.  Molecular similarity and diversity in chemoinformatics: From theory to applications , 2006, Molecular Diversity.

[88]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[89]  Jitender Verma,et al.  3D-QSAR in drug design--a review. , 2010, Current topics in medicinal chemistry.

[90]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[91]  Jérôme Hert,et al.  New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching , 2006, J. Chem. Inf. Model..

[92]  J. Aubé,et al.  Butitaxel analogues: synthesis and structure-activity relationships. , 1997, Journal of medicinal chemistry.

[93]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[94]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[95]  Hugo Kubinyi,et al.  Free Wilson Analysis. Theory, Applications and its Relationship to Hansch Analysis , 1988 .

[96]  J. Bajorath,et al.  Recent Advances in Scaffold Hopping. , 2017, Journal of medicinal chemistry.

[97]  John MacCuish,et al.  Chemoinformatics applications of cluster analysis , 2014 .