Open-source QSAR models for pKa prediction using multiple machine learning approaches

BackgroundThe logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction.MethodsThe experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure–activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed based on different processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN).ResultsThe three methods delivered comparable performances on the training and test sets with a root-mean-squared error (RMSE) around 1.5 and a coefficient of determination (R2) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and performance of our models compared favorably to the commercial products.ConclusionsThis work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub.

[1]  Harvey J. Clewell,et al.  High-throughput in-silico prediction of ionization equilibria for pharmacokinetic modeling. , 2018, The Science of the total environment.

[2]  acid dissociation constant , 2009 .

[3]  Kamel Mansouri,et al.  A comparison of three liquid chromatography (LC) retention time prediction models. , 2018, Talanta.

[4]  L. Di,et al.  Physicochemical profiling: overview of the screens. , 2004, Drug discovery today. Technologies.

[5]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[6]  Davide Ballabio,et al.  Evaluation of model predictive ability by external validation techniques , 2010 .

[7]  Antony J. Williams,et al.  ToxCast Chemical Landscape: Paving the Road to 21st Century Toxicology. , 2016, Chemical research in toxicology.

[8]  Igor V. Tetko,et al.  Modeling of non-additive mixture properties using the Online CHEmical database and Modeling environment (OCHEM) , 2013, Journal of Cheminformatics.

[9]  Davide Castelvecchi,et al.  Can we open the black box of AI? , 2016, Nature.

[10]  Alex Zhavoronkov,et al.  Applications of Deep Learning in Biomedicine. , 2016, Molecular pharmaceutics.

[11]  Antony J. Williams,et al.  OPERA models for predicting physicochemical properties and environmental fate endpoints , 2018, Journal of Cheminformatics.

[12]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[13]  Robert G. Pearce,et al.  Evaluating In Vitro-In Vivo Extrapolation of Toxicokinetics , 2018, Toxicological sciences : an official journal of the Society of Toxicology.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[16]  Thomas Sander,et al.  DataWarrior: An Open-Source Program For Chemistry Aware Data Visualization And Analysis , 2015, J. Chem. Inf. Model..

[17]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[18]  Marc C. Nicklaus,et al.  Comparison of Nine Programs Predicting pKa Values of Pharmaceutical Substances , 2009, J. Chem. Inf. Model..

[19]  Andy Liaw,et al.  Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships , 2016, J. Chem. Inf. Model..

[20]  D. Manallack,et al.  Drug Targeting of α-Synuclein Oligomerization in Synucleinopathies , 2007 .

[21]  Leopold Parts,et al.  Computational biology: deep learning , 2017, Emerging topics in life sciences.

[22]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[23]  R. M. Muir,et al.  Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients , 1962, Nature.

[24]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[25]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[26]  I. Tetko,et al.  Application of ALOGPS to predict 1-octanol/water distribution coefficients, logP, and logD, of AstraZeneca in-house database. , 2004, Journal of pharmaceutical sciences.

[27]  R. Leardi,et al.  Genetic algorithms applied to feature selection in PLS regression: how and when to use them , 1998 .

[28]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[29]  Imran Shah,et al.  Predicting Organ Toxicity Using in Vitro Bioactivity Data and Chemical Structure. , 2017, Chemical research in toxicology.

[30]  R. Frische,et al.  Physicochemical properties as useful tools for predicting the environmental fate of organic chemicals. , 1982, Ecotoxicology and environmental safety.

[31]  Igor V. Tetko,et al.  Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information , 2011, J. Comput. Aided Mol. Des..

[32]  Gordon M. Crippen,et al.  Predicting p K a . , 2009 .

[33]  Melvin E. Andersen,et al.  Incorporating High-Throughput Exposure Predictions With Dosimetry-Adjusted In Vitro Bioactivity to Inform Chemical Toxicity Testing , 2015, Toxicological sciences : an official journal of the Society of Toxicology.

[34]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[35]  I. Tetko,et al.  Predicting the pKa of Small Molecules , 2011 .

[36]  Gordon M. Crippen,et al.  Predicting pKa , 2009, J. Chem. Inf. Model..

[37]  Paul Voosen,et al.  How AI detectives are cracking open the black box of deep learning , 2017 .

[38]  Budget,et al.  Memorandum for the Heads of Executive Departments and Agencies: Open Data Policy--Managing Information as an Asset , 2013 .

[39]  B. Obama Executive Order 13642: Making Open and Machine Readable the New Default for Government Information , 2013 .

[40]  Abhinav Vishnu,et al.  Deep learning for computational chemistry , 2017, J. Comput. Chem..

[41]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[42]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[43]  Roberto Todeschini,et al.  Beware of Unreliable Q2! A Comparative Study of Regression Metrics for Predictivity Assessment of QSAR Models , 2016, J. Chem. Inf. Model..

[44]  S. Joshua Swamidass,et al.  Deep Learning to Predict the Formation of Quinone Species in Drug Metabolism. , 2017, Chemical research in toxicology.

[45]  Alexander Tropsha,et al.  Trust, but Verify II: A Practical Guide to Chemogenomics Data Curation , 2016, J. Chem. Inf. Model..

[46]  A M Richard,et al.  An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling$ , 2016, SAR and QSAR in environmental research.

[47]  Loriano Storchi,et al.  In silico pKa Prediction and ADME Profiling , 2009, Chemistry & biodiversity.

[48]  Division on Earth,et al.  A Framework to Guide Selection of Chemical Alternatives , 2014 .

[49]  Roberto Todeschini,et al.  Comparison of Different Approaches to Define the Applicability Domain of QSAR Models , 2012, Molecules.

[50]  Ruili Huang,et al.  CERAPP: Collaborative Estrogen Receptor Activity Prediction Project , 2016, Environmental health perspectives.

[51]  Mahdi Vasighi,et al.  Genetic Algorithms for architecture optimisation of Counter-Propagation Artificial Neural Networks , 2011 .

[52]  Johann Gasteiger,et al.  New Publicly Available Chemical Query Language, CSRML, To Support Chemotype Representations for Application to Data Mining and Modeling , 2015, J. Chem. Inf. Model..