Development of Dimethyl Sulfoxide Solubility Models Using 163 000 Molecules: Using a Domain Applicability Metric to Select More Reliable Predictions

The dimethyl sulfoxide (DMSO) solubility data from Enamine and two UCB pharma compound collections were analyzed using 8 different machine learning methods and 12 descriptor sets. The analyzed data sets were highly imbalanced with 1.7–5.8% nonsoluble compounds. The libraries’ enrichment by soluble molecules from the set of 10% of the most reliable predictions was used to compare prediction performances of the methods. The highest accuracies were calculated using a C4.5 decision classification tree, random forest, and associative neural networks. The performances of the methods developed were estimated on individual data sets and their combinations. The developed models provided on average a 2-fold decrease of the number of nonsoluble compounds amid all compounds predicted as soluble in DMSO. However, a 4–9-fold enrichment was observed if only 10% of the most reliable predictions were considered. The structural features influencing compounds to be soluble or nonsoluble in DMSO were also determined. The best models developed with the publicly available Enamine data set are freely available online at http://ochem.eu/article/33409.

[1]  I. Tetko,et al.  Applicability domain for in silico models to achieve accuracy of experimental measurements , 2010 .

[2]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[3]  Lemont B. Kier,et al.  Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information , 1995, J. Chem. Inf. Comput. Sci..

[4]  Norbert Haider,et al.  Functionality Pattern Matching as an Efficient Complementary Structure/Reaction Search Tool: an Open-Source Approach , 2010, Molecules.

[5]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[6]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[7]  I. Tetko,et al.  ISIDA - Platform for Virtual Screening Based on Fragment and Pharmacophoric Descriptors , 2008 .

[8]  Johann Gasteiger,et al.  Of molecules and humans. , 2006, Journal of medicinal chemistry.

[9]  Igor V. Tetko,et al.  Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set , 2010, J. Chem. Inf. Model..

[10]  C. E. Love,et al.  The Pummerer Rearrangement , 2007 .

[11]  Guillermo Moyna,et al.  Shape signatures: a new approach to computer-aided ligand- and receptor-based drug design. , 2003, Journal of medicinal chemistry.

[12]  Igor V. Tetko,et al.  Associative Neural Network , 2002, Neural Processing Letters.

[13]  William L. Jorgensen,et al.  Journal of Chemical Information and Modeling , 2005, J. Chem. Inf. Model..

[14]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[15]  I. Tetko,et al.  In silico approaches to prediction of aqueous and DMSO solubility of drug-like compounds: trends, problems and solutions. , 2006, Current medicinal chemistry.

[16]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[17]  Vladimir Potemkin,et al.  Modeling of drug molecule orientation within a receptor cavity in the BiS algorithm framework , 2007 .

[18]  Yan A Ivanenkov,et al.  In Silico Estimation of DMSO Solubility of Organic Compounds for Bioscreening , 2004, Journal of biomolecular screening.

[19]  S. Heller,et al.  An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier , 2003 .

[20]  Veerabahu Shanmugasundaram,et al.  Estimation of Aqueous Solubility of Organic Compounds with QSPR Approach , 2004, Pharmaceutical Research.

[21]  Igor V. Tetko,et al.  ToxAlerts: A Web Server of Structural Alerts for Toxic Chemicals and Compounds with Potential Adverse Reactions , 2012, J. Chem. Inf. Model..

[22]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[23]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[24]  A. Cherkasov Inductive Descriptors: 10 Successful Years in QSAR , 2005 .

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Igor V. Tetko,et al.  Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena pyriformis , 2008, Journal of Chemical Information and Modeling.

[27]  Igor I. Baskin,et al.  Chemical graphs and their basis invariants , 1999 .

[28]  Igor V. Tetko,et al.  Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena pyriformis , 2008, J. Chem. Inf. Model..

[29]  N. Bodor,et al.  Neural network studies: Part 3. Prediction of partition coefficients , 1994 .

[30]  Stefan Wetzel,et al.  Interactive exploration of chemical space with Scaffold Hunter. , 2009, Nature chemical biology.

[31]  Igor V. Tetko,et al.  Application of Associative Neural Networks for Prediction of Lipophilicity in ALOGPS 2.1 Program , 2002, J. Chem. Inf. Comput. Sci..

[32]  C. Steinbeck,et al.  The Chemistry Development Kit (CDK): An Open‐Source Java Library for Chemo‐ and Bioinformatics. , 2003 .

[33]  I. Sushko,et al.  Applicability Domain of QSAR models , 2011 .

[34]  L. Hall,et al.  Molecular Structure Description: The Electrotopological State , 1999 .

[35]  John S. Delaney,et al.  ESOL: Estimating Aqueous Solubility Directly from Molecular Structure , 2004, J. Chem. Inf. Model..

[36]  Igor V. Tetko,et al.  The perspectives of computational chemistry modeling , 2011, Journal of Computer-Aided Molecular Design.

[37]  Igor I. Baskin,et al.  Fragmental descriptors with labeled atoms and their application in QSAR/QSPR studies , 2007 .

[38]  Gert Thijs,et al.  Application of spectrophores™ to map vendor chemical space using self-organising maps , 2011, J. Cheminformatics.

[39]  Igor V Tetko,et al.  A comparison of different QSAR approaches to modeling CYP450 1A2 inhibition , 2011, J. Chem. Inf. Model..

[40]  Igor V. Tetko,et al.  Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information , 2011, J. Comput. Aided Mol. Des..

[41]  Artem Cherkasov,et al.  Substituent Effects on Thermochemical Properties of Free Radicals. New Substituent Scales for C-Centered Radicals , 1998, J. Chem. Inf. Comput. Sci..

[42]  Igor V. Tetko,et al.  Neural Network Studies, 4. Introduction to Associative Neural Networks , 2002, J. Chem. Inf. Comput. Sci..

[43]  Gregg D. Wilensky,et al.  Neural Network Studies , 1993 .