Evaluation of QSAR Equations for Virtual Screening

Quantitative Structure Activity Relationship (QSAR) models can inform on the correlation between activities and structure-based molecular descriptors. This information is important for the understanding of the factors that govern molecular properties and for designing new compounds with favorable properties. Due to the large number of calculate-able descriptors and consequently, the much larger number of descriptors combinations, the derivation of QSAR models could be treated as an optimization problem. For continuous responses, metrics which are typically being optimized in this process are related to model performances on the training set, for example, R2 and QCV2. Similar metrics, calculated on an external set of data (e.g., QF1/F2/F32), are used to evaluate the performances of the final models. A common theme of these metrics is that they are context -” ignorant”. In this work we propose that QSAR models should be evaluated based on their intended usage. More specifically, we argue that QSAR models developed for Virtual Screening (VS) should be derived and evaluated using a virtual screening-aware metric, e.g., an enrichment-based metric. To demonstrate this point, we have developed 21 Multiple Linear Regression (MLR) models for seven targets (three models per target), evaluated them first on validation sets and subsequently tested their performances on two additional test sets constructed to mimic small-scale virtual screening campaigns. As expected, we found no correlation between model performances evaluated by “classical” metrics, e.g., R2 and QF1/F2/F32 and the number of active compounds picked by the models from within a pool of random compounds. In particular, in some cases models with favorable R2 and/or QF1/F2/F32 values were unable to pick a single active compound from within the pool whereas in other cases, models with poor R2 and/or QF1/F2/F32 values performed well in the context of virtual screening. We also found no significant correlation between the number of active compounds correctly identified by the models in the training, validation and test sets. Next, we have developed a new algorithm for the derivation of MLR models by optimizing an enrichment-based metric and tested its performances on the same datasets. We found that the best models derived in this manner showed, in most cases, much more consistent results across the training, validation and test sets and outperformed the corresponding MLR models in most virtual screening tests. Finally, we demonstrated that when tested as binary classifiers, models derived for the same targets by the new algorithm outperformed Random Forest (RF) and Support Vector Machine (SVM)-based models across training/validation/test sets, in most cases. We attribute the better performances of the Enrichment Optimizer Algorithm (EOA) models in VS to better handling of inactive random compounds. Optimizing an enrichment-based metric is therefore a promising strategy for the derivation of QSAR models for classification and virtual screening.

[1]  Jürgen Bajorath,et al.  Multiobjective Particle Swarm Optimization: Automated Identification of Structure-Activity Relationship-Informative Compounds with Favorable Physicochemical Property Distributions , 2012, J. Chem. Inf. Model..

[2]  Tomas Oberg A QSAR for baseline toxicity: validation, domain of application, and prediction. , 2004, Chemical research in toxicology.

[3]  Abraham Yosipof,et al.  Optimization Algorithms for Chemoinformatics and Material-informatics , 2016 .

[4]  Alexander Golbraikh,et al.  Predictive QSAR modeling workflow, model applicability domains, and virtual screening. , 2007, Current pharmaceutical design.

[5]  Eugene N. Muratov,et al.  QSAR-Based Virtual Screening: Advances and Applications in Drug Discovery , 2018, Front. Pharmacol..

[6]  D. Winkler,et al.  Discovery and Optimization of Materials Using Evolutionary Approaches. , 2016, Chemical reviews.

[7]  Kenneth M. Merz,et al.  QMQSAR: Utilization of a semiempirical probe potential in a field‐based QSAR method , 2005, J. Comput. Chem..

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Roberto Todeschini,et al.  Molecular descriptors for chemoinformatics , 2009 .

[10]  Alexander Tropsha,et al.  QSAR models of human data can enrich or replace LLNA testing for human skin sensitization. , 2016, Green chemistry : an international journal and green chemistry resource : GC.

[11]  Marcus Gastreich,et al.  The next level in chemical space navigation: going far beyond enumerable compound libraries. , 2019, Drug discovery today.

[12]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[13]  R. Todeschini,et al.  Molecular Descriptors for Chemoinformatics: Volume I: Alphabetical Listing / Volume II: Appendices, References , 2009 .

[14]  A. Rollett,et al.  The Monte Carlo Method , 2004 .

[15]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[16]  Yurii S. Moroz,et al.  Ultra-large library docking for discovering new chemotypes , 2019, Nature.

[17]  Hojung Nam,et al.  Prediction models for drug-induced hepatotoxicity by using weighted molecular fingerprints , 2017, BMC Bioinformatics.

[18]  Jessica A. Wignall,et al.  Conditional Toxicity Value (CTV) Predictor: An In Silico Approach for Generating Quantitative Risk Estimates for Chemicals , 2018, Environmental health perspectives.

[19]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[20]  Paola Gramatica,et al.  A Historical Excursus on the Statistical Validation Parameters for QSAR Models: A Clarification Concerning Metrics and Terminology , 2016, J. Chem. Inf. Model..

[21]  Emilio Benfenati,et al.  QSAR Model for Predicting Pesticide Aquatic Toxicity , 2005, J. Chem. Inf. Model..

[22]  Tingjun Hou,et al.  Applications of Genetic Algorithms on the Structure-Activity Relationship Analysis of Some Cinnamamides , 1999, J. Chem. Inf. Comput. Sci..

[23]  Ivan Rusyn,et al.  Predicting drug-induced hepatotoxicity using QSAR and toxicogenomics approaches. , 2011, Chemical research in toxicology.

[24]  Paul Shinn,et al.  Computer-Aided Discovery and Characterization of Novel Ebola Virus Inhibitors. , 2018, Journal of medicinal chemistry.

[25]  Jens Meiler,et al.  Identification of Metabotropic Glutamate Receptor Subtype 5 Potentiators Using Virtual High-Throughput Screening , 2010, ACS chemical neuroscience.

[26]  Ralph Kühne,et al.  External Validation and Prediction Employing the Predictive Squared Correlation Coefficient Test Set Activity Mean vs Training Set Activity Mean , 2008, J. Chem. Inf. Model..

[27]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[28]  Thomas Blaschke,et al.  Exploring the GDB-13 chemical space using deep generative models , 2018, Journal of Cheminformatics.

[29]  J. Reymond The chemical space project. , 2015, Accounts of chemical research.

[30]  Amiram Goldblum,et al.  A stochastic algorithm for global optimization and for best populations: A test case of side chains in proteins , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[32]  David A. Winkler,et al.  Beware of R2: Simple, Unambiguous Assessment of the Prediction Accuracy of QSAR and QSPR Models , 2015, J. Chem. Inf. Model..

[33]  Roberto Todeschini,et al.  Comments on the Definition of the Q2 Parameter for QSAR Validation , 2009, J. Chem. Inf. Model..

[34]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[35]  Alexander Tropsha,et al.  Tuning HERG out: antitarget QSAR models for drug development. , 2014, Current topics in medicinal chemistry.

[36]  Jean-Louis Reymond,et al.  Exploring Chemical Space with Machine Learning. , 2019, Chimia.