One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties

Many chemoinformatics applications, including high-throughput virtual screening, benefit from being able to rapidly predict the physical, chemical, and biological properties of small molecules to screen large repositories and identify suitable candidates. When training sets are available, machine learning methods provide an effective alternative to ab initio methods for these predictions. Here, we leverage rich molecular representations including 1D SMILES strings, 2D graphs of bonds, and 3D coordinates to derive efficient machine learning kernels to address regression problems. We further expand the library of available spectral kernels for small molecules developed for classification problems to include 2.5D surface and 3D kernels using Delaunay tetrahedrization and other techniques from computational geometry, 3D pharmacophore kernels, and 3.5D or 4D kernels capable of taking into account multiple molecular configurations, such as conformers. The kernels are comprehensively tested using cross-validation and redundancy-reduction methods on regression problems using several available data sets to predict boiling points, melting points, aqueous solubility, octanol/water partition coefficients, and biological activity with state-of-the art results. When sufficient training data are available, 2D spectral kernels in general tend to yield the best and most robust results, better than state-of-the art. On data sets containing thousands of molecules, the kernels achieve a squared correlation coefficient of 0.91 for aqueous solubility prediction and 0.94 for octanol/water partition coefficient prediction. Averaging over conformations improves the performance of kernels based on the three-dimensional structure of molecules, especially on challenging data sets. Kernel predictors for aqueous solubility (kSOL), LogP (kLOGP), and melting point (kMELT) are available over the Web through: http://cdb.ics.uci.edu.

[1]  Pierre Baldi,et al.  Structure-based inhibitor design of AccD5, an essential acyl-CoA carboxylase carboxyltransferase domain of Mycobacterium tuberculosis. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[3]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..

[4]  John S. Delaney,et al.  ESOL: Estimating Aqueous Solubility Directly from Molecular Structure , 2004, J. Chem. Inf. Model..

[5]  Ulf Norinder,et al.  Molecular Descriptors Influencing Melting Point and Their Role in Classification of Solid Drugs , 2003, J. Chem. Inf. Comput. Sci..

[6]  A. Hopkins,et al.  Navigating chemical space for biology and medicine , 2004, Nature.

[7]  Alessio Micheli,et al.  Analysis of the Internal Representations Developed by Neural Networks for Structures Applied to Quantitative Structure-Activity Relationship Studies of Benzodiazepines , 2001, J. Chem. Inf. Comput. Sci..

[8]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[9]  Jean-Philippe Vert,et al.  The Pharmacophore Kernel for Virtual Screening with Support Vector Machines , 2006, J. Chem. Inf. Model..

[10]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[11]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[12]  Pierre Baldi,et al.  Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval , 2007, J. Chem. Inf. Model..

[13]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[14]  D. Agrafiotis,et al.  Combinatorial informatics in the post-genomics era , 2002, Nature Reviews Drug Discovery.

[15]  Jing Sun,et al.  Comparative Study of Factor Xa Inhibitors Using Molecular Docking/SVM/HQSAR/3D-QSAR Methods , 2006 .

[16]  D. Villemin,et al.  Use of a neural network to determine the boiling point of alkanes , 1994 .

[17]  Johann Gasteiger,et al.  Chemical Information in 3D Space , 1996, J. Chem. Inf. Comput. Sci..

[18]  Peter Willett,et al.  Promoting Access to White Rose Research Papers Effectiveness of Graph-based and Fingerprint-based Similarity Measures for Virtual Screening of 2d Chemical Structure Databases , 2022 .

[19]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[20]  David Weininger,et al.  SMILES. 2. Algorithm for generation of unique SMILES notation , 1989, J. Chem. Inf. Comput. Sci..

[21]  C. Hansch,et al.  QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS OF THE BENZODIAZEPINES. A REVIEW AND REEVALUATION , 1994 .

[22]  David Page,et al.  Multiple Instance Regression , 2001, ICML.

[23]  C. Dobson Chemical space and biology , 2004, Nature.

[24]  Bernd Beck,et al.  Prediction of the n-Octanol/Water Partition Coefficient, logP, Using a Combination of Semiempirical MO-Calculations and a Neural Network , 1997 .

[25]  C. Hansch,et al.  Quantitative Structure‐Activity Relationships of the Benzodiazepines. A Review and Reevaluation. , 1995 .

[26]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[27]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[28]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[29]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[30]  Alessandro Sperduti,et al.  Special issue on neural networks and kernel methods for structured domains , 2005, Neural Networks.

[31]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[32]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[33]  James T. Kwok,et al.  A regularization framework for multiple-instance learning , 2006, ICML.

[34]  Gerhard Klebe,et al.  Comparison of Automatic Three-Dimensional Model Builders Using 639 X-ray Structures , 1994, J. Chem. Inf. Comput. Sci..

[35]  Herbert Edelsbrunner,et al.  Three-dimensional alpha shapes , 1992, VVS.

[36]  Andreas Zell,et al.  Towards Optimal Descriptor Subset Selection with Support Vector Machines in Classification and Regression , 2004 .

[37]  A. Micheli,et al.  A Novel Approach to QSPR/QSAR Based on Neural Networks for Structures , 2003 .

[38]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[39]  Bernhard Schölkopf,et al.  Fast Kernels for String and Tree Matching , 2004 .

[40]  Luhua Lai,et al.  A New Atom-Additive Method for Calculating Partition Coefficients , 1997, J. Chem. Inf. Comput. Sci..

[41]  Ruisheng Zhang,et al.  Support Vector Machines-Based Quantitative Structure-Property Relationship for the Prediction of Heat Capacity , 2004, J. Chem. Inf. Model..

[42]  Muthukumarasamy Karthikeyan,et al.  General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks , 2005, J. Chem. Inf. Model..

[43]  Tatsuya Akutsu,et al.  Graph Kernels for Molecular Structure-Activity Relationship Analysis with Support Vector Machines , 2005, J. Chem. Inf. Model..