A confidence predictor for logD using conformal regression and a support-vector machine

Lipophilicity is a major determinant of ADMET properties and overall suitability of drug candidates. We have developed large-scale models to predict water–octanol distribution coefficient (logD) for chemical compounds, aiding drug discovery projects. Using ACD/logD data for 1.6 million compounds from the ChEMBL database, models are created and evaluated by a support-vector machine with a linear kernel using conformal prediction methodology, outputting prediction intervals at a specified confidence level. The resulting model shows a predictive ability of $$\hbox {Q}^{2}=0.973$$Q2=0.973 and with the best performing nonconformity measure having median prediction interval of $$\pm ~0.39$$±0.39 log units at 80% confidence and $$\pm ~0.60$$±0.60 log units at 90% confidence. The model is available as an online service via an OpenAPI interface, a web page with a molecular editor, and we also publish predictive values at 90% confidence level for 91 M PubChem structures in RDF format for download and as an URI resolver service.

[1]  M. Waring Lipophilicity in drug discovery , 2010, Expert Opinion on Drug Discovery.

[2]  P. Leeson,et al.  The influence of drug-like concepts on decision-making in medicinal chemistry , 2007, Nature Reviews Drug Discovery.

[3]  Scott Boyer,et al.  Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination , 2014, J. Chem. Inf. Model..

[4]  U Norinder,et al.  Conformal prediction to define applicability domain – A case study on predicting ER and AR binding , 2016, SAR and QSAR in environmental research.

[5]  A. Bender,et al.  Prediction of PARP Inhibition with Proteochemometric Modelling and Conformal Prediction , 2015, Molecular informatics.

[6]  Vladimir Vovk,et al.  Cross-conformal predictors , 2012, Annals of Mathematics and Artificial Intelligence.

[7]  Egon L. Willighagen,et al.  PubChemRDF: towards the semantic annotation of PubChem compound and substance databases , 2015, Journal of Cheminformatics.

[8]  Ola Spjuth,et al.  Benchmarking Study of Parameter Variation When Using Signature Fingerprints Together with Support Vector Machines , 2014, J. Chem. Inf. Model..

[9]  Haris Haralambous,et al.  Reliable prediction intervals with regression neural networks , 2011, Neural Networks.

[10]  George Papadatos,et al.  The ChEMBL database in 2017 , 2016, Nucleic Acids Res..

[11]  Isidro Cortes-Ciriano,et al.  Improved large-scale prediction of growth inhibition patterns using the NCI60 cancer cell line panel , 2015, Bioinform..

[12]  Scott Boyer,et al.  Interpretation of Nonlinear QSAR Models Applied to Ames Mutagenicity Data , 2009, J. Chem. Inf. Model..

[13]  Ulf Norinder,et al.  Predicting the Rate of Skin Penetration Using an Aggregated Conformal Prediction Framework. , 2017, Molecular pharmaceutics.

[14]  R. Mannhold,et al.  Calculation of molecular lipophilicity: state of the art and comparison of methods on more than 96000 compounds , 2009, Journal of pharmaceutical sciences.

[15]  W. Gasarch,et al.  The Book Review Column 1 Coverage Untyped Systems Simple Types Recursive Types Higher-order Systems General Impression 3 Organization, and Contents of the Book , 2022 .

[16]  Li Di,et al.  Pharmaceutical profiling in drug discovery. , 2003, Drug discovery today.

[17]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[18]  Jean-Loup Faulon,et al.  The Signature Molecular Descriptor. 1. Using Extended Valence Sequences in QSAR and QSPR Studies , 2003, J. Chem. Inf. Comput. Sci..

[19]  Yun Alelyunas,et al.  A high throughput dried DMSO LogD lipophilicity measurement based on 96-well shake-flask and atmospheric pressure photoionization mass spectrometry detection. , 2010, Journal of chromatography. A.

[20]  J. Hughes,et al.  Physiochemical drug properties associated with in vivo toxicological outcomes. , 2008, Bioorganic & medicinal chemistry letters.

[21]  M. Waring,et al.  A quantitative assessment of hERG liability as a function of lipophilicity. , 2007, Bioorganic & medicinal chemistry letters.

[22]  F. Blasco,et al.  Optimised method to estimate octanol water distribution coefficient (logD) in a high throughput format. , 2016, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[23]  M. Hann,et al.  Finding the sweet spot: the role of nature and nurture in medicinal chemistry , 2012, Nature Reviews Drug Discovery.

[24]  M. Edwards,et al.  Using the Golden Triangle to optimize clearance and oral absorption. , 2009, Bioorganic & medicinal chemistry letters.

[25]  Miguel A. Martínez-Prieto,et al.  Exchange and Consumption of Huge RDF Data , 2012, ESWC.

[26]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[27]  Axel Polleres,et al.  Binary RDF representation for publication and exchange (HDT) , 2013, J. Web Semant..

[28]  Peter Ertl,et al.  JSME: a free molecule editor in JavaScript , 2013, Journal of Cheminformatics.

[29]  Ola Spjuth,et al.  Integrated Decision Support for Assessing Chemical Liabilities , 2011, J. Chem. Inf. Model..

[30]  P. Verhoest,et al.  Moving beyond rules: the development of a central nervous system multiparameter optimization (CNS MPO) approach to enable alignment of druglike properties. , 2010, ACS chemical neuroscience.

[31]  Nina Jeliazkova,et al.  AMBIT RESTful web services: an implementation of the OpenTox application programming interface , 2011, J. Cheminformatics.

[32]  Nina Jeliazkova,et al.  AMBIT‐SMARTS: Efficient Searching of Chemical Structures and Fragments , 2011, Molecular informatics.

[33]  Núria Queralt-Rosinach,et al.  The Semanticscience Integrated Ontology (SIO) for biomedical research and knowledge discovery , 2014, J. Biomed. Semant..

[34]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[35]  M. Waring Defining optimum lipophilicity and molecular weight ranges for drug candidates-Molecular weight dependent lower logD limits based on permeability. , 2009, Bioorganic & medicinal chemistry letters.

[36]  Ola Spjuth,et al.  Large-scale ligand-based predictive modelling using support vector machines , 2016, Journal of Cheminformatics.

[37]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching , 2017, Journal of Cheminformatics.