From Descriptors to Predicted Properties: Experimental Design by Using Applicability Domain Estimation

The importance of reliable methods for representative sub-sampling in terms of experimental design and risk assessment within the European Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH) system is crucial. We developed experimental design approaches, by utilising predicted properties and the ‘distance to model’ parameter, to estimate the benefits of certain compounds to the quality of a resulting model. A statistical evaluation of four regression data sets and one classification data set showed that the adaptive concept of iteratively refining the representation of the chemical space contributes to a more efficient and more reliable selection in comparison to traditional approaches. The evaluation of compounds with regard to the uncertainty and the correlation of prediction is beneficial, and in particular, for regression data sets of sufficient size, whereas the use of predicted properties to define the chemical space is beneficial for classification models.

[1]  Manuela Pavan,et al.  DRAGON SOFTWARE: AN EASY APPROACH TO MOLECULAR DESCRIPTOR CALCULATIONS , 2006 .

[2]  I. Tetko,et al.  Applicability domain for in silico models to achieve accuracy of experimental measurements , 2010 .

[3]  Lemont B. Kier,et al.  An Electrotopological-State Index for Atoms in Molecules , 1990, Pharmaceutical Research.

[4]  Igor V. Tetko,et al.  Application of Associative Neural Networks for Prediction of Lipophilicity in ALOGPS 2.1 Program , 2002, J. Chem. Inf. Comput. Sci..

[5]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[6]  Igor V. Tetko,et al.  Classification of CYP450 1A2 inhibitors using PubChem data , 2010, J. Cheminformatics.

[7]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[8]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[9]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[10]  Paola Gramatica,et al.  An Update of the BCF QSAR Model Based on Theoretical Molecular Descriptors , 2005 .

[11]  D. Roberts,et al.  Chemistry-toxicity relationships for the effects of di- and trihydroxybenzenes to Tetrahymena pyriformis. , 2005, Chemical research in toxicology.

[12]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[13]  Ursula Gundert-Remy,et al.  The Use of (Q)SAR Methods in the Context of REACH , 2008, Toxicology mechanisms and methods.

[14]  Igor V. Tetko,et al.  Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information , 2011, J. Cheminformatics.

[15]  Brian D. Hudson,et al.  Parameter Based Methods for Compound Selection from Chemical Databases , 1996 .

[16]  Artem Cherkasov,et al.  An updated steroid benchmark set and its application in the discovery of novel nanomolar ligands of sex hormone-binding globulin. , 2008, Journal of medicinal chemistry.

[17]  T. W. Schultz,et al.  TETRATOX: TETRAHYMENA PYRIFORMIS POPULATION GROWTH IMPAIRMENT ENDPOINTA SURROGATE FOR FISH LETHALITY , 1997 .

[18]  Raimund Mannhold,et al.  Large‐Scale Evaluation of log P Predictors: Local Corrections May Compensate Insufficient Accuracy and Need of Experimentally Testing Every Other Compound , 2009, Chemistry & biodiversity.

[19]  Jeannot Trampert,et al.  Optimal nonlinear Bayesian experimental design: an application to amplitude versus offset experiments , 2003 .

[20]  S. Ghosal,et al.  Convergence properties of sequential Bayesian D-optimal designs , 2009 .

[21]  M T D Cronin,et al.  Evaluation of QSARs for ecotoxicity: A method for assigning quality and confidence , 2004, SAR and QSAR in environmental research.

[22]  Robin Taylor,et al.  Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals , 1995, J. Chem. Inf. Comput. Sci..

[23]  Robert S. Boethling,et al.  Molecular topology/fragment contribution method for predicting soil sorption coefficients , 1992 .

[24]  Vladimir Potemkin,et al.  Technique for Energy Decomposition in the Study of "Receptor-Ligand" Complexes , 2009, J. Chem. Inf. Model..

[25]  Igor V. Tetko,et al.  PLS-Optimal: A Stepwise D-Optimal Design Based on Latent Variables , 2012, J. Chem. Inf. Model..

[26]  T. Lundstedt,et al.  Experimental design and optimization , 1998 .

[27]  M. S. Khots,et al.  D-optimal designs , 1995 .

[28]  David Vidal,et al.  Nomen Est Omen: Quantitative Prediction of Molecular Properties Directly from IUPAC Names , 2007 .

[29]  Vladimir Potemkin,et al.  A new paradigm for pattern recognition of drugs , 2008, J. Comput. Aided Mol. Des..

[30]  M. Hewitt,et al.  Assessing Applicability Domains of Toxicological QSARs: Definition, Confidence in Predicted Values, and the Role of Mechanisms of Action , 2007 .

[31]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[32]  Lemont B. Kier,et al.  Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information , 1995, J. Chem. Inf. Comput. Sci..

[33]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[34]  I. Tetko,et al.  An evaluation of experimental design in QSAR modelling utilizing the k‐medoid clustering , 2012 .

[35]  Igor V. Tetko,et al.  Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set , 2010, J. Chem. Inf. Model..