Using Random Forest To Model the Domain Applicability of Another Random Forest Model

In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities. We will call this traditional type of QSAR model an "activity model". The activity model can be used to predict the activities of molecules not in the training set. A relatively new subfield for QSAR is domain applicability. The aim is to estimate the reliability of prediction of a specific molecule on a specific activity model. A number of different metrics have been proposed in the literature for this purpose. It is desirable to build a quantitative model of reliability against one or more of these metrics. We can call this an "error model". A previous publication from our laboratory (Sheridan J. Chem. Inf. Model., 2012, 52, 814-823.) suggested the simultaneous use of three metrics would be more discriminating than any one metric. An error model could be built in the form of a three-dimensional set of bins. When the number of metrics exceeds three, however, the bin paradigm is not practical. An obvious solution for constructing an error model using multiple metrics is to use a QSAR method, in our case random forest. In this paper we demonstrate the usefulness of this paradigm, specifically for determining whether a useful error model can be built and which metrics are most useful for a given problem. For the ten data sets and for the seven metrics we examine here, it appears that it is possible to construct a useful error model using only two metrics (TREE_SD and PREDICTED). These do not require calculating similarities/distances between the molecules being predicted and the molecules used to build the activity model, which can be rate-limiting.

[1]  Lars Carlsson,et al.  QSAR with experimental and predictive distributions: an information theoretic approach for assessing model quality , 2013, Journal of Computer-Aided Molecular Design.

[2]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[3]  Robert P. Sheridan,et al.  Three Useful Dimensions for Domain Applicability in QSAR Models Using Random Forest , 2012, J. Chem. Inf. Model..

[4]  Oliver Kohlbacher,et al.  No Longer Confidential: Estimating the Confidence of Individual Regression Predictions , 2012, PloS one.

[5]  Igor V. Tetko,et al.  Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set , 2010, J. Chem. Inf. Model..

[6]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[7]  Rajarshi Guha,et al.  Determining the Validity of a QSAR Model - A Classification Approach , 2005, J. Chem. Inf. Model..

[8]  D G Sprous,et al.  Fingerprint-based clustering applied to define a QSAR model use radius. , 2008, Journal of molecular graphics & modelling.

[9]  Gregory W. Kauffman,et al.  Interpretable, Probability-Based Confidence Metric for Continuous Quantitative Structure-Activity Relationship Models , 2013, J. Chem. Inf. Model..

[10]  Robert P. Sheridan,et al.  Chemical Similarity Using Physiochemical Property Descriptors , 1996, J. Chem. Inf. Comput. Sci..

[11]  Bernd Beck,et al.  QM/NN QSPR Models with Error Estimation: Vapor Pressure and LogP , 2000, J. Chem. Inf. Comput. Sci..

[12]  Ralph Kühne,et al.  Chemical Domain of QSAR Models from Atom-Centered Fragments , 2009, J. Chem. Inf. Model..

[13]  Judith C. Madden,et al.  Assessment of Methods To Define the Applicability Domain of Structural Alert Models , 2011, J. Chem. Inf. Model..

[14]  Horvath Dragos,et al.  Predicting the predictability: a unified approach to the applicability domain problem of QSAR models. , 2009, Journal of chemical information and modeling.

[15]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..

[16]  Marc Strickert,et al.  Target‐Driven Subspace Mapping Methods and Their Applicability Domain Estimation , 2011, Molecular informatics.

[17]  Ting Wang,et al.  Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling , 2005, J. Chem. Inf. Model..

[18]  Peter W. Kenny,et al.  Inflation of correlation in the pursuit of drug-likeness , 2013, Journal of Computer-Aided Molecular Design.

[19]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[20]  Robert P. Sheridan,et al.  Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction , 2013, J. Chem. Inf. Model..

[21]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[22]  Igor I Baskin,et al.  The One‐Class Classification Approach to Data Description and to Models Applicability Domain , 2010, Molecular informatics.

[23]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[24]  Vijay K. Gombar,et al.  Quantitative Structure − Activity Relationship Models of Clinical Pharmacokinetics : Clearance and Volume of Distribution , 2013 .

[25]  Robert D. Clark,et al.  DPRESS: Localizing estimates of predictive uncertainty , 2009, J. Cheminformatics.

[26]  Shane Weaver,et al.  The importance of the domain of applicability in QSAR modeling. , 2008, Journal of molecular graphics & modelling.

[27]  Ruili Huang,et al.  Predictive Models for Cytochrome P450 Isozymes Based on Quantitative High Throughput Screening Data , 2011, J. Chem. Inf. Model..

[28]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.