QSAR with experimental and predictive distributions: an information theoretic approach for assessing model quality

We propose that quantitative structure–activity relationship (QSAR) predictions should be explicitly represented as predictive (probability) distributions. If both predictions and experimental measurements are treated as probability distributions, the quality of a set of predictive distributions output by a model can be assessed with Kullback–Leibler (KL) divergence: a widely used information theoretic measure of the distance between two probability distributions. We have assessed a range of different machine learning algorithms and error estimation methods for producing predictive distributions with an analysis against three of AstraZeneca’s global DMPK datasets. Using the KL-divergence framework, we have identified a few combinations of algorithms that produce accurate and valid compound-specific predictive distributions. These methods use reliability indices to assign predictive distributions to the predictions output by QSAR models so that reliable predictions have tight distributions and vice versa. Finally we show how valid predictive distributions can be used to estimate the probability that a test compound has properties that hit single- or multi- objective target profiles.

[1]  Ola Spjuth,et al.  The C1C2: A framework for simultaneous model selection and assessment , 2008, BMC Bioinformatics.

[2]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[3]  Miha Vuk,et al.  ROC curve, lift chart and calibration plot , 2006, Advances in Methodology and Statistics.

[4]  Alfred A. Rabow,et al.  Enantiomeric pairs reveal that key medicinal chemistry parameters vary more than simple physical property based models can explain , 2012 .

[5]  Ismael Zamora,et al.  pH-Dependent Bidirectional Transport of Weakly Basic Drugs Across Caco-2 Monolayers: Implications for Drug–Drug Interactions , 2003, Pharmaceutical Research.

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Matthew D Segall,et al.  Multi-parameter optimization: identifying high quality compounds with a balance of properties. , 2012, Current pharmaceutical design.

[8]  Nicolaas M. Faber,et al.  Estimating the uncertainty in estimates of root mean square error of prediction: application to determining the size of an adequate test set in multivariate calibration , 1999 .

[9]  Igor Kononenko,et al.  Comparison of approaches for estimating reliability of individual regression predictions , 2008, Data Knowl. Eng..

[10]  Martyn Plummer,et al.  JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling , 2003 .

[11]  Joelle M. R. Gola,et al.  Focus on success: using a probabilistic approach to achieve an optimal balance of compound properties in drug discovery , 2006, Expert opinion on drug metabolism & toxicology.

[12]  M. Hewitt,et al.  Assessing Applicability Domains of Toxicological QSARs: Definition, Confidence in Predicted Values, and the Role of Mechanisms of Action , 2007 .

[13]  Jürgen Bajorath,et al.  Development of a Fingerprint Reduction Approach for Bayesian Similarity Searching Based on Kullback-Leibler Divergence Analysis , 2009, J. Chem. Inf. Model..

[14]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[15]  J C G Lesurf Information and measurement , 1995 .

[16]  G. McLachlan,et al.  An Algorithm for Fitting Mixtures of Gompertz Distributions to Censored Survival Data , 1997 .

[17]  H. Akaike A new look at the statistical model identification , 1974 .

[18]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[19]  Igor V. Tetko,et al.  Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set , 2010, J. Chem. Inf. Model..

[20]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[21]  Shane Weaver,et al.  The importance of the domain of applicability in QSAR modeling. , 2008, Journal of molecular graphics & modelling.

[22]  Ulf Norinder,et al.  Automated QSAR with a Hierarchy of Global and Local Models , 2011, Molecular informatics.

[23]  Robert D. Clark,et al.  DPRESS: Localizing estimates of predictive uncertainty , 2009, J. Cheminformatics.

[24]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[25]  Ullrika Sahlin,et al.  A Risk Assessment Perspective of Current Practice in Characterizing Uncertainties in QSAR Regression Predictions , 2011, Molecular informatics.

[26]  Pierre Bruneau,et al.  logD7.4 Modeling Using Bayesian Regularized Neural Networks. Assessment and Correction of the Errors of Prediction , 2006, J. Chem. Inf. Model..

[27]  Ralph Kühne,et al.  Chemical Domain of QSAR Models from Atom-Centered Fragments , 2009, J. Chem. Inf. Model..

[28]  Matthew Paul Gleeson,et al.  Strategies for the generation, validation and application of in silico ADMET models in lead generation and optimization , 2012, Expert opinion on drug metabolism & toxicology.

[29]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[30]  Matthew Segall,et al.  Beyond Profiling: Using ADMET Models to Guide Decisions , 2009, Chemistry & biodiversity.

[31]  Andrew G. Leach,et al.  Matched molecular pairs as a guide in the optimization of pharmaceutical properties; a study of aqueous solubility, plasma protein binding and oral exposure. , 2006, Journal of medicinal chemistry.

[32]  D. Rawat,et al.  Synthesis and anticancer activity evaluation of resveratrol–chalcone conjugates , 2014 .

[33]  Leetsch C. Hsu,et al.  On a Class of Combinatorial Sums Involving Generalized Factorials , 2007, Int. J. Math. Math. Sci..

[34]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[35]  Robert P. Sheridan,et al.  Three Useful Dimensions for Domain Applicability in QSAR Models Using Random Forest , 2012, J. Chem. Inf. Model..

[36]  Walter Krämer,et al.  Review of Modern applied statistics with S, 4th ed. by W.N. Venables and B.D. Ripley. Springer-Verlag 2002 , 2003 .

[37]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[38]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[39]  Gábor Csányi,et al.  Gaussian Processes: A Method for Automatic QSAR Modeling of ADME Properties , 2007, J. Chem. Inf. Model..

[40]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[41]  Arthur M. Doweyko,et al.  QSAR: dead or alive? , 2008, J. Comput. Aided Mol. Des..

[42]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[43]  P. Filzmoser,et al.  Repeated double cross validation , 2009 .

[44]  Peter C Austin,et al.  Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. , 2006, Journal of clinical epidemiology.

[45]  Frank R. Burden,et al.  Quantitative Structure-Activity Relationship Studies Using Gaussian Processes , 2001, J. Chem. Inf. Comput. Sci..

[46]  David Meyer,et al.  Support Vector Machines ∗ The Interface to libsvm in package , 2001 .

[47]  Patrick Barton,et al.  A Method for Measuring the Lipophilicity of Compounds in Mixtures of 10 , 2011, Journal of biomolecular screening.

[48]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[49]  Yvonne C. Martin,et al.  Application of Belief Theory to Similarity Data Fusion for Use in Analog Searching and Lead Hopping , 2008, J. Chem. Inf. Model..