Interpretation of Compound Activity Predictions from Complex Machine Learning Models Using Local Approximations and Shapley Values.

In qualitative or quantitative studies of structure-activity relationships (SARs), machine learning (ML) models are trained to recognize structural patterns that differentiate between active and inactive compounds. Understanding model decisions is challenging but of critical importance to guide compound design. Moreover, the interpretation of ML results provides an additional level of model validation based on expert knowledge. A number of complex ML approaches, especially deep learning (DL) architectures, have distinctive black-box character. Herein, a locally interpretable explanatory method termed SHapley Additive exPlanations (SHAP) is introduced for rationalizing activity predictions of any ML algorithm, regardless of its complexity. Models resulting from random forest (RF), non-linear support vector machine (SVM), and deep neural network (DNN) calculations are interpreted and structural patterns that increase or reduce the predicted probability of activity are identified and mapped onto test compounds. The results indicate that SHAP has high potential for rationalizing predictions of complex ML models.

[1]  D Horvath,et al.  Interpretability of SAR/QSAR Models of any Complexity by Atomic Contributions , 2012, Molecular informatics.

[2]  David A. Winkler,et al.  Understanding the Roles of the "Two QSARs" , 2016, J. Chem. Inf. Model..

[3]  Antonio Lavecchia,et al.  Machine-learning approaches in drug discovery: methods and applications. , 2015, Drug discovery today.

[4]  Jürgen Bajorath,et al.  Support Vector Machine Classification and Regression Prioritize Different Structural Features for Binary Compound Activity and Potency Value Prediction , 2017, ACS omega.

[5]  Jürgen Bajorath,et al.  Multitask Machine Learning for Classifying Highly and Weakly Potent Kinase Inhibitors , 2019, ACS Omega.

[6]  Igor I. Baskin,et al.  Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? , 2012, J. Chem. Inf. Model..

[7]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[8]  Igor V Tetko,et al.  A renaissance of neural networks in drug discovery , 2016, Expert opinion on drug discovery.

[9]  Pavel Polishchuk,et al.  Interpretation of Quantitative Structure-Activity Relationship Models: Past, Present, and Future , 2017, J. Chem. Inf. Model..

[10]  Andreas Verras,et al.  Is Multitask Deep Learning Practical for Pharma? , 2017, J. Chem. Inf. Model..

[11]  Jürgen Bajorath,et al.  Influence of Search Parameters and Criteria on Compound Selection, Promiscuity, and Pan Assay Interference Characteristics , 2014, J. Chem. Inf. Model..

[12]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[13]  Scott M. Lundberg,et al.  Explainable machine-learning predictions for the prevention of hypoxaemia during surgery , 2018, Nature Biomedical Engineering.

[14]  George Papadatos,et al.  Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set , 2017, bioRxiv.

[15]  Jürgen Bajorath,et al.  Introduction of a Methodology for Visualization and Graphical Interpretation of Bayesian Classification Models , 2014, J. Chem. Inf. Model..

[16]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[17]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[18]  John J. Irwin,et al.  ZINC 15 – Ligand Discovery for Everyone , 2015, J. Chem. Inf. Model..

[19]  W. J. Conover,et al.  On Methods of Handling Ties in the Wilcoxon Signed-Rank Test , 1973 .

[20]  Jürgen Bajorath,et al.  Integration of virtual and high-throughput screening , 2002, Nature Reviews Drug Discovery.

[21]  J. Baell,et al.  New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. , 2010, Journal of medicinal chemistry.

[22]  Jürgen Bajorath,et al.  Prediction of Compound Profiling Matrices, Part II: Relative Performance of Multitask Deep Learning and Random Forest Classification on the Basis of Varying Amounts of Training Data , 2018, ACS omega.

[23]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[24]  Arthur M. Doweyko,et al.  QSAR: dead or alive? , 2008, J. Comput. Aided Mol. Des..

[25]  S. So,et al.  Application of neural networks: quantitative structure-activity relationships of the derivatives of 2,4-diamino-5-(substituted-benzyl)pyrimidines as DHFR inhibitors. , 1992, Journal of medicinal chemistry.

[26]  George Papadatos,et al.  Activity, assay and target data curation and quality in the ChEMBL database , 2015, Journal of Computer-Aided Molecular Design.

[27]  Andy Liaw,et al.  Demystifying Multitask Deep Neural Networks for Quantitative Structure-Activity Relationships , 2017, J. Chem. Inf. Model..

[28]  Rajarshi Guha,et al.  On the interpretation and interpretability of quantitative structure–activity relationship models , 2008, J. Comput. Aided Mol. Des..

[29]  Jürgen Bajorath,et al.  Computational Method for the Systematic Identification of Analog Series and Key Compounds Representing Series and Their Biological Activity Profiles. , 2016, Journal of medicinal chemistry.

[30]  Sean Ekins The Next Era: Deep Learning in Pharmaceutical Research , 2016, Pharmaceutical Research.

[31]  Anthony E. Klon,et al.  Improved Naïve Bayesian Modeling of Numerical Data for Absorption, Distribution, Metabolism and Excretion (ADME) Property Prediction , 2006, J. Chem. Inf. Model..

[32]  I I Baskin,et al.  An approach to the interpretation of backpropagation neural network models in QSAR studies , 2002, SAR and QSAR in environmental research.

[33]  Jürgen Bajorath,et al.  Visualization and Interpretation of Support Vector Machine Activity Predictions , 2015, J. Chem. Inf. Model..

[34]  Jürgen Bajorath,et al.  Prediction of Compound Profiling Matrices Using Machine Learning , 2018, ACS omega.

[35]  Richard A. Lewis A general method for exploiting QSAR models in lead optimization. , 2005, Journal of medicinal chemistry.

[36]  Marc C. Nicklaus,et al.  QSAR Modeling of Imbalanced High-Throughput Screening Data in PubChem , 2014, J. Chem. Inf. Model..

[37]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[38]  Knut Baumann,et al.  Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation , 2014, Journal of Cheminformatics.

[39]  Matthias Rarey,et al.  In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening , 2019, J. Chem. Inf. Model..

[40]  Russ B Altman,et al.  Machine learning in chemoinformatics and drug discovery. , 2018, Drug discovery today.

[41]  Timon Schroeter,et al.  Visual Interpretation of Kernel‐Based Prediction Models , 2011, Molecular informatics.

[42]  Jürgen Bajorath,et al.  Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds , 2017, J. Chem. Inf. Model..

[43]  Henrik Boström,et al.  Trade-off between accuracy and interpretability for predictive in silico modeling. , 2011, Future medicinal chemistry.

[44]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[45]  J. Bajorath,et al.  Learning from 'big data': compounds and targets. , 2014, Drug discovery today.

[46]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[47]  Andreas Bender,et al.  Recognizing Pitfalls in Virtual Screening: A Critical Review , 2012, J. Chem. Inf. Model..