Estimation of the applicability domain of kernel-based machine learning models for virtual screening

BackgroundThe virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model.ResultsWe evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening.ConclusionThe proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.

[1]  E Benfenati,et al.  Additive SMILES-based optimal descriptors in QSAR modelling bee toxicity: Using rare SMILES attributes to define the applicability domain. , 2008, Bioorganic & medicinal chemistry.

[2]  J. Bajorath,et al.  Docking and scoring in virtual screening for drug discovery: methods and applications , 2004, Nature Reviews Drug Discovery.

[3]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..

[4]  Andreas Zell,et al.  Optimal assignment methods for ligand-based virtual screening , 2009, J. Cheminformatics.

[5]  Brian K. Shoichet,et al.  Molecular docking using shape descriptors , 1992 .

[6]  M. Hewitt,et al.  Assessing Applicability Domains of Toxicological QSARs: Definition, Confidence in Predicted Values, and the Role of Mechanisms of Action , 2007 .

[7]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[8]  Jia Jia,et al.  Comparative analysis of machine learning methods in ligand-based virtual screening of large compound libraries. , 2009, Combinatorial chemistry & high throughput screening.

[9]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[10]  W. Patrick Walters,et al.  ESCHER-A Computer Program for the Determination of External Rotational Symmetry Numbers from Molecular Topology , 1996, J. Chem. Inf. Comput. Sci..

[11]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..

[12]  Gisbert Schneider,et al.  Kernel Approach to Molecular Similarity Based on Iterative Graph Similarity , 2007, J. Chem. Inf. Model..

[13]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[14]  Andrew C. Good,et al.  Measuring CAMD technique performance: A virtual screening case study in the design of validation experiments , 2004, J. Comput. Aided Mol. Des..

[15]  Jeremy G. Vinter,et al.  FieldScreen: Virtual Screening Using Molecular Fields. Application to the DUD Data Set , 2008, J. Chem. Inf. Model..

[16]  Andreas Zell,et al.  Kernel Functions for Attributed Molecular Graphs – A New Similarity‐Based Approach to ADME Prediction in Classification and Regression , 2006 .

[17]  Alexandre Varnek,et al.  Chemoinformatics approaches to virtual screening , 2008 .

[18]  Rajarshi Guha,et al.  Structure-Activity Landscape Index: Identifying and Quantifying Activity Cliffs , 2008, J. Chem. Inf. Model..

[19]  Ferran Sanz,et al.  Anchor-GRIND: filling the gap between standard 3D QSAR and the GRid-INdependent descriptors. , 2005, Journal of medicinal chemistry.

[20]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[21]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[22]  Jean-Philippe Vert,et al.  The Pharmacophore Kernel for Virtual Screening with Support Vector Machines , 2006, J. Chem. Inf. Model..

[23]  Gilles Marcou,et al.  Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models , 2009, J. Chem. Inf. Model..

[24]  Peter Willett,et al.  Similarity-based virtual screening using 2D fingerprints. , 2006, Drug discovery today.

[25]  Pang-Ning Tan,et al.  Receiver Operating Characteristic , 2009, Encyclopedia of Database Systems.

[26]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[27]  Andreas Zell,et al.  Optimal assignment kernels for attributed molecular graphs , 2005, ICML.

[28]  Gisbert Schneider,et al.  Shapelets: Possibilities and limitations of shape‐based virtual screening , 2008, J. Comput. Chem..

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  Jérôme Hert,et al.  Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures , 2004, J. Chem. Inf. Model..

[31]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[32]  Tatsuya Akutsu,et al.  Graph Kernels for Molecular Structure-Activity Relationship Analysis with Support Vector Machines , 2005, J. Chem. Inf. Model..

[33]  Klaus-Robert Müller,et al.  Accurate Solubility Prediction with Error Bars for Electrolytes: A Machine Learning Approach , 2007, J. Chem. Inf. Model..

[34]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[35]  Jürgen Bajorath,et al.  Molecular similarity analysis uncovers heterogeneous structure-activity relationships and variable activity landscapes. , 2007, Chemistry & biology.

[36]  Alexander Golbraikh,et al.  Differentiation of AmpC beta-lactamase binders vs. decoys using classification kNN QSAR modeling and application of the QSAR classifier to virtual screening , 2008, J. Comput. Aided Mol. Des..

[37]  Dariusz Plewczynski,et al.  Performance of machine learning methods for ligand-based virtual screening. , 2009, Combinatorial chemistry & high throughput screening.

[38]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[39]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[40]  Rajarshi Guha,et al.  Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors , 2004, J. Chem. Inf. Model..

[41]  Andreas Zell,et al.  Atomic Local Neighborhood Flexibility Incorporation into a Structured Similarity Measure for QSAR , 2009, J. Chem. Inf. Model..

[42]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[43]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[44]  Nina Nikolova-Jeliazkova,et al.  An Approach to Determining Applicability Domains for QSAR Group Contribution Models: An Analysis of SRC KOWWIN , 2005, Alternatives to laboratory animals : ATLA.

[45]  Johann Gasteiger,et al.  A new model for calculating atomic charges in molecules , 1978 .

[46]  Tudor I. Oprea,et al.  Optimization of CAMD techniques 3. Virtual screening enrichment studies: a help or hindrance in tool selection? , 2008, J. Comput. Aided Mol. Des..

[47]  Horvath Dragos,et al.  Predicting the predictability: a unified approach to the applicability domain problem of QSAR models. , 2009, Journal of chemical information and modeling.

[48]  G Klebe,et al.  Three-dimensional quantitative structure-activity relationship analyses using comparative molecular field analysis and comparative molecular similarity indices analysis to elucidate selectivity differences of inhibitors binding to trypsin, thrombin, and factor Xa. , 1999, Journal of medicinal chemistry.

[49]  D. Rognan,et al.  Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/scoring combinations. , 2000, Journal of medicinal chemistry.

[50]  Jean-Philippe Vert,et al.  The optimal assignment kernel is not positive definite , 2008, ArXiv.

[51]  Thomas Stützle,et al.  Empirical Scoring Functions for Advanced Protein-Ligand Docking with PLANTS , 2009, J. Chem. Inf. Model..

[52]  J. Pin,et al.  Virtual screening workflow development guided by the "receiver operating characteristic" curve approach. Application to high-throughput docking on metabotropic glutamate receptor subtype 4. , 2005, Journal of medicinal chemistry.

[53]  Pierre Baldi,et al.  One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties , 2007, J. Chem. Inf. Model..

[54]  Klaus-Robert Müller,et al.  Machine learning models for lipophilicity and their domain of applicability. , 2007, Molecular pharmaceutics.

[55]  J. Bajorath,et al.  SAR index: quantifying the nature of structure-activity relationships. , 2007, Journal of medicinal chemistry.

[56]  Keith Abe,et al.  Identification of orally active, potent, and selective 4-piperazinylquinazolines as antagonists of the platelet-derived growth factor receptor tyrosine kinase family. , 2002, Journal of medicinal chemistry.

[57]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[58]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[59]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[60]  Claudio N. Cavasotto,et al.  Ligand docking and structure-based virtual screening in drug discovery. , 2007, Current topics in medicinal chemistry.

[61]  Klaus-Robert Müller,et al.  Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules , 2007, J. Comput. Aided Mol. Des..

[62]  J M Blaney,et al.  A geometric approach to macromolecule-ligand interactions. , 1982, Journal of molecular biology.

[63]  K-R Müller,et al.  Virtual screening for PPAR-gamma ligands using the ISOAK molecular graph kernel and gaussian processes , 2009 .

[64]  Jonathan D Hirst,et al.  Machine learning in virtual screening. , 2009, Combinatorial chemistry & high throughput screening.

[65]  Jürgen Bajorath,et al.  Integration of virtual and high-throughput screening , 2002, Nature Reviews Drug Discovery.