Can human experts predict solubility better than computers?

In this study, we design and carry out a survey, asking human experts to predict the aqueous solubility of druglike organic compounds. We investigate whether these experts, drawn largely from the pharmaceutical industry and academia, can match or exceed the predictive power of algorithms. Alongside this, we implement 10 typical machine learning algorithms on the same dataset. The best algorithm, a variety of neural network known as a multi-layer perceptron, gave an RMSE of 0.985 log S units and an R2 of 0.706. We would not have predicted the relative success of this particular algorithm in advance. We found that the best individual human predictor generated an almost identical prediction quality with an RMSE of 0.942 log S units and an R2 of 0.723. The collection of algorithms contained a higher proportion of reasonably good predictors, nine out of ten compared with around half of the humans. We found that, for either humans or algorithms, combining individual predictions into a consensus predictor by taking their median generated excellent predictivity. While our consensus human predictor achieved very slightly better headline figures on various statistical measures, the difference between it and the consensus machine learning predictor was both small and statistically insignificant. We conclude that human experts can predict the aqueous solubility of druglike molecules essentially equally well as machine learning algorithms. We find that, for either humans or algorithms, combining individual predictions into a consensus predictor by taking their median is a powerful way of benefitting from the wisdom of crowds.

[1]  Jie Shen,et al.  admetSAR: A Comprehensive Source and Free Tool for Assessment of Chemical ADMET Properties , 2012, J. Chem. Inf. Model..

[2]  S. Yalkowsky,et al.  Handbook of Aqueous Solubility Data, Second Edition , 2010 .

[3]  P. Veng‐Pedersen,et al.  Quantitative structure-pharmacokinetic relationships for systemic drug distribution kinetics not confined to a congeneric series. , 1994, Journal of pharmaceutical sciences.

[4]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[5]  Noel M. O'Boyle Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI , 2012, Journal of Cheminformatics.

[6]  Bernhard Schölkopf,et al.  Support Vector Machines , 2005 .

[7]  L. Narasimham,et al.  Kinetic and intrinsic solubility determination of some β-blockers and antidiabetics by potentiometry , 2016 .

[8]  F. Galton Vox Populi , 1907, Nature.

[9]  A. Roli Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[10]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[11]  RadhaKanta Mahapatra,et al.  Business data mining - a machine learning perspective , 2001, Inf. Manag..

[12]  Samuel H. Yalkowsky,et al.  Prediction of Drug Solubility by the General Solubility Equation (GSE) , 2001, J. Chem. Inf. Comput. Sci..

[13]  W. L. Jorgensen,et al.  Prediction of drug solubility from structure. , 2002, Advanced drug delivery reviews.

[14]  M J Sternberg,et al.  Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[15]  B. B. Johnson,et al.  Aqueous Solubilities of Estrone, 17-Estradiol, 17a-Ethynylestradiol, and Bisphenol A , 2006 .

[16]  Nicholas Shea,et al.  Fundamental Issues of Artificial Intelligence , 2016 .

[17]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[18]  Ral Garreta,et al.  Learning scikit-learn: Machine Learning in Python , 2013 .

[19]  Antony W. Goodwin,et al.  ELECTRICAL SYNAPSES IN THE MAMMALIAN BRAIN , 2010 .

[20]  Robert C. Glen,et al.  Random Forest Models To Predict Aqueous Solubility , 2007, J. Chem. Inf. Model..

[21]  Karl Box,et al.  New Ideas about the Solubility of Drugs , 2009, Chemistry & biodiversity.

[22]  Sunil S. Bhagwat,et al.  Prediction of Melting Points of Organic Compounds Using Extreme Learning Machines , 2008 .

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  John B. O. Mitchell,et al.  Predicting Melting Points of Organic Molecules: Applications to Aqueous Solubility Prediction Using the General Solubility Equation , 2015, Molecular informatics.

[25]  Ulf Norinder,et al.  Global and Local Computational Models for Aqueous Solubility Prediction of Drug-Like Molecules , 2004, J. Chem. Inf. Model..

[26]  Emilio Xavier Esposito,et al.  Findings of the Challenge To Predict Aqueous Solubility , 2009, J. Chem. Inf. Model..

[27]  Pilar Ventosa-Andrés,et al.  DRUG SOLUBILITY : IMPORTANCE AND ENHANCEMENT TECHNIQUES , 2016 .

[28]  S. Venkatesh,et al.  Aqueous and cosolvent solubility data for drug-like organic compounds , 2005, The AAPS Journal.

[29]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[30]  D. J. Brown,et al.  103. Pteridine studies. Part I. Pteridine, and 2- and 4-amino- and 2- and 4-hydroxy-pteridines , 1951 .

[31]  David D. Denison,et al.  Nonlinear estimation and classification , 2003 .

[32]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[33]  Katharine J. Mach,et al.  Ocean acidification and its impacts: an expert survey , 2013, Climatic Change.

[34]  C. Lipinski Poor aqueous solubility-an industry wide problem in drug discovery , 2002 .

[35]  Robert C. Glen,et al.  Solubility Challenge: Can You Predict Solubilities of 32 Molecules Using a Database of 100 Reliable Measurements? , 2008, J. Chem. Inf. Model..

[36]  Samy Bengio,et al.  Links between perceptrons, MLPs and SVMs , 2004, ICML.

[37]  Wei-Yang Lin,et al.  Intrusion detection by machine learning: A review , 2009, Expert Syst. Appl..

[38]  Edmund A. Mennis The Wisdom of Crowds: Why the Many Are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations , 2006 .

[39]  John B. O. Mitchell,et al.  Is experimental data quality the limiting factor in predicting the aqueous solubility of druglike molecules? , 2014, Molecular pharmaceutics.

[40]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[41]  J. Dearden,et al.  The intrinsic aqueous solubility of indomethacin , 2014 .

[42]  John B. O. Mitchell Machine learning methods in chemoinformatics , 2014, Wiley interdisciplinary reviews. Computational molecular science.

[43]  Christian Borgelt,et al.  Computational Intelligence , 2016, Texts in Computer Science.

[44]  P Schneider,et al.  Multi-objective active machine learning rapidly improves structure–activity models and reveals new protein–protein interaction inhibitors† †Electronic supplementary information (ESI) available: Details about computational comparisons and all screening results. See DOI: 10.1039/c5sc04272k , 2016, Chemical science.

[45]  Vincent C. Mller Fundamental Issues of Artificial Intelligence - 2nd Conference on Philosophy and Theory of Artificial Intelligence, PT-AI 2013, Oxford, UK, September 21-22, 2013, selected and invited papers , 2016, PT-AI.

[46]  D. Simon,et al.  Sulfadiazine crystalluria revisited. The treatment of Toxoplasma encephalitis in patients with acquired immunodeficiency syndrome. , 1990, Archives of internal medicine.

[47]  Zhen Li,et al.  A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model , 2008, BMC Bioinformatics.

[48]  S. Yalkowsky,et al.  Handbook of aqueous solubility data , 2003 .

[49]  Christel A. S. Bergström,et al.  Accuracy of calculated pH-dependent aqueous drug solubility. , 2004, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[50]  M. Murcko,et al.  Consensus scoring: A method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. , 1999, Journal of medicinal chemistry.

[51]  Thierry Denoeux,et al.  A k-nearest neighbor classification rule based on Dempster-Shafer theory , 1995, IEEE Trans. Syst. Man Cybern..

[52]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[53]  Tanja Van Mourik,et al.  Uniting Cheminformatics and Chemical Theory To Predict the Intrinsic Aqueous Solubility of Crystalline Druglike Molecules , 2014, J. Chem. Inf. Model..

[54]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[55]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[56]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[57]  Antonio Lavecchia,et al.  Machine-learning approaches in drug discovery: methods and applications. , 2015, Drug discovery today.

[58]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[59]  Peter Willett,et al.  The use of 2D fingerprint methods to support the assessment of structural similarity in orphan drug legislation , 2014, Journal of Cheminformatics.

[60]  A. Albert,et al.  886. Pteridine studies. Part X. Pteridines with more than one hydroxy- or amino-group , 1956 .

[61]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[62]  John B. O. Mitchell,et al.  First-Principles Calculation of the Intrinsic Aqueous Solubility of Crystalline Druglike Molecules. , 2012, Journal of chemical theory and computation.

[63]  Z. Popovic,et al.  Crystal structure of a monomeric retroviral protease solved by protein folding game players , 2011, Nature Structural &Molecular Biology.

[64]  Pierre Baldi,et al.  Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules , 2013, J. Chem. Inf. Model..

[65]  T. Kennedy Managing the drug discovery/development interface , 1997 .

[66]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[67]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[68]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[69]  Florian Nigsch,et al.  Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P , 2008, J. Chem. Inf. Model..

[70]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[71]  T.R. Martinez,et al.  Using permutations instead of student's t distribution for p-values in paired-difference algorithm comparisons , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[72]  John B. O. Mitchell,et al.  Predicting intrinsic aqueous solubility by a thermodynamic cycle. , 2008, Molecular pharmaceutics.

[73]  J. Graham,et al.  Leveraging the Wisdom of Crowds in a Data-Rich Utopia , 2012 .