Machine learning models for lipophilicity and their domain of applicability.

Unfavorable lipophilicity and water solubility cause many drug failures; therefore these properties have to be taken into account early on in lead discovery. Commercial tools for predicting lipophilicity usually have been trained on small and neutral molecules, and are thus often unable to accurately predict in-house data. Using a modern Bayesian machine learning algorithm--a Gaussian process model--this study constructs a log D7 model based on 14,556 drug discovery compounds of Bayer Schering Pharma. Performance is compared with support vector machines, decision trees, ridge regression, and four commercial tools. In a blind test on 7013 new measurements from the last months (including compounds from new projects) 81% were predicted correctly within 1 log unit, compared to only 44% achieved by commercial software. Additional evaluations using public data are presented. We consider error bars for each method (model based error bars, ensemble based, and distance based approaches), and investigate how well they quantify the domain of applicability of each model.

[1]  A. O'Hagan,et al.  Curve Fitting and Optimal Design for Prediction , 1978 .

[2]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[3]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[4]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[5]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[6]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[7]  Frank R. Burden,et al.  Quantitative Structure-Activity Relationship Studies Using Gaussian Processes , 2001, J. Chem. Inf. Comput. Sci..

[8]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[9]  D P Enot,et al.  Gaussian Process: An Efficient Technique to Solve Quantitative Structure-Property Relationship Problems , 2001, SAR and QSAR in environmental research.

[10]  Tingjun Hou,et al.  ADME evaluation in drug discovery , 2002, Journal of molecular modeling.

[11]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[12]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery. 3. Modeling Blood-Brain Barrier Partitioning Using Simple Molecular Descriptors , 2003, J. Chem. Inf. Comput. Sci..

[13]  Brian D. Hudson,et al.  A Consensus Neural Network-Based Technique for Discriminating Soluble and Poorly Soluble Compounds , 2003, J. Chem. Inf. Comput. Sci..

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Peter Tiño,et al.  Nonlinear Prediction of Quantitative Structure-Activity Relationships , 2004, J. Chem. Inf. Model..

[16]  Weida Tong,et al.  Assessment of Prediction Confidence and Domain Extrapolation of Two Structure–Activity Relationship Models for Predicting Estrogen Receptor Binding Activity , 2004, Environmental health perspectives.

[17]  Matthew Clark,et al.  Generalized Fragment-Substructure Based Property Prediction Method , 2005, J. Chem. Inf. Model..

[18]  R. Glen,et al.  Screening for Dihydrofolate Reductase Inhibitors Using MOLPRINT 2D, a Fast Fragment-Based Method Employing the Naïve Bayesian Classifier: Limitations of the Descriptor and the Importance of Balanced Chemistry in Training and Test Sets , 2005, Journal of biomolecular screening.

[19]  Gunnar Rätsch,et al.  Classifying 'Drug-likeness' with Kernel-Based Learning Methods , 2005, J. Chem. Inf. Model..

[20]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[21]  Carl E. Rasmussen,et al.  A Unifying View of Sparse Approximate Gaussian Process Regression , 2005, J. Mach. Learn. Res..

[22]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[23]  Alexander Tropsha,et al.  Chapter 7 Variable Selection QSAR Modeling, Model Validation, and Virtual Screening , 2006 .

[24]  Ralph Kühne,et al.  Model Selection Based on Structural Similarity-Method Description and Application to Water Solubility Prediction , 2006, J. Chem. Inf. Model..

[25]  Gang Wang,et al.  Two-dimensional solution path for support vector regression , 2006, ICML.

[26]  Hongmao Sun,et al.  An Accurate and Interpretable Bayesian Classification Model for Prediction of hERG Liability , 2006, ChemMedChem.

[27]  Timothy Clark,et al.  In Silico Prediction of Buffer Solubility Based on Quantum-Mechanical and HQSAR- and Topology-Based Descriptors , 2006, J. Chem. Inf. Model..

[28]  Klaus-Robert Müller,et al.  Accurate Solubility Prediction with Error Bars for Electrolytes: A Machine Learning Approach , 2007, J. Chem. Inf. Model..

[29]  K. Müller,et al.  Predicting Lipophilicity of Drug‐Discovery Molecules using Gaussian Process Models , 2007, ChemMedChem.

[30]  K. Müller,et al.  Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules , 2007, Journal of computer-aided molecular design.

[31]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.