Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.

[1]  A. O'Hagan,et al.  Curve Fitting and Optimal Design for Prediction , 1978 .

[2]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[3]  B. Silverman,et al.  Density estimation in action , 1986 .

[4]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[7]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[8]  Jarmo Huuskonen,et al.  Estimation of Aqueous Solubility for a Diverse Set of Organic Compounds Based on Molecular Topology , 2000, J. Chem. Inf. Comput. Sci..

[9]  Igor V. Tetko,et al.  Neural Network Modeling for Estimation of Partition Coefficient Based on Atom-Type Electrotopological State Indices , 2000, J. Chem. Inf. Comput. Sci..

[10]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[11]  Igor V. Tetko,et al.  Estimation of Aqueous Solubility of Chemical Compounds Using E-State Indices , 2001, J. Chem. Inf. Comput. Sci..

[12]  Martyn G. Ford,et al.  Simultaneous prediction of aqueous solubility and octanol/water partition coefficient based on descriptors derived from molecular structure , 2001, J. Comput. Aided Mol. Des..

[13]  Neera Jain,et al.  Prediction of Aqueous Solubility of Organic Compounds by the General Solubility Equation (GSE) , 2001, J. Chem. Inf. Comput. Sci..

[14]  J. Gasteiger,et al.  Prediction of Aqueous Solubility of Organic Compounds by Topological Descriptors , 2003 .

[15]  Brian D. Hudson,et al.  A Consensus Neural Network-Based Technique for Discriminating Soluble and Poorly Soluble Compounds , 2003, J. Chem. Inf. Comput. Sci..

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[18]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[19]  Weida Tong,et al.  Assessment of Prediction Confidence and Domain Extrapolation of Two Structure–Activity Relationship Models for Predicting Estrogen Receptor Binding Activity , 2004, Environmental health perspectives.

[20]  Matthew Clark,et al.  Generalized Fragment-Substructure Based Property Prediction Method , 2005, J. Chem. Inf. Model..

[21]  R. Glen,et al.  Screening for Dihydrofolate Reductase Inhibitors Using MOLPRINT 2D, a Fast Fragment-Based Method Employing the Naïve Bayesian Classifier: Limitations of the Descriptor and the Importance of Balanced Chemistry in Training and Test Sets , 2005, Journal of biomolecular screening.

[22]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[23]  J. Delaney Predicting aqueous solubility from structure. , 2005, Drug discovery today.

[24]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[25]  Alexander Tropsha,et al.  Chapter 7 Variable Selection QSAR Modeling, Model Validation, and Virtual Screening , 2006 .

[26]  Pierre Bruneau,et al.  logD7.4 Modeling Using Bayesian Regularized Neural Networks. Assessment and Correction of the Errors of Prediction , 2006, J. Chem. Inf. Model..

[27]  Ralph Kühne,et al.  Model Selection Based on Structural Similarity-Method Description and Application to Water Solubility Prediction , 2006, J. Chem. Inf. Model..

[28]  Gang Wang,et al.  Two-dimensional solution path for support vector regression , 2006, ICML.

[29]  Hongmao Sun,et al.  An Accurate and Interpretable Bayesian Classification Model for Prediction of hERG Liability , 2006, ChemMedChem.

[30]  Timothy Clark,et al.  In Silico Prediction of Buffer Solubility Based on Quantum-Mechanical and HQSAR- and Topology-Based Descriptors , 2006, J. Chem. Inf. Model..

[31]  I. Tetko,et al.  In silico approaches to prediction of aqueous and DMSO solubility of drug-like compounds: trends, problems and solutions. , 2006, Current medicinal chemistry.

[32]  W. Patrick Walters,et al.  Chapter 8 Machine Learning in Computational Chemistry , 2006 .

[33]  Klaus-Robert Müller,et al.  Accurate Solubility Prediction with Error Bars for Electrolytes: A Machine Learning Approach , 2007, J. Chem. Inf. Model..

[34]  K. Müller,et al.  Predicting Lipophilicity of Drug‐Discovery Molecules using Gaussian Process Models , 2007, ChemMedChem.

[35]  Klaus-Robert Müller,et al.  Machine learning models for lipophilicity and their domain of applicability. , 2007, Molecular pharmaceutics.

[36]  Stephen R. Johnson,et al.  Recent progress in the computational prediction of aqueous solubility and absorption , 2006, The AAPS Journal.

[37]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[38]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.