Predicting the predictability: a unified approach to the applicability domain problem of QSAR models.

The present work proposes a unified conceptual framework to describe and quantify the important issue of the Applicability Domains (AD) of Quantitative Structure-Activity Relationships (QSARs). AD models are conceived as meta-models micromicro designed to associate an untrustworthiness score to any molecule M subject to property prediction by a QSAR model micro. Untrustworthiness scores or "AD metrics" Psimicro(M) are an expression of the relationship between M (represented by its descriptors in chemical space) and the space zones populated by the training molecules at the basis of model mu. Scores integrating some of the classical AD criteria (similarity-based, box-based) were considered in addition to newly invented terms such as the consensus prediction variance, the dissimilarity to outlier-free training sets, and the correlation breakdown count (the former two being most successful). A loose correlation is expected to exist between this untrustworthiness and the error |Pmicro(M)-Pexpt(M)| affecting the property Pmicro(M) predicted by micro. While high untrustworthiness does not preclude correct predictions, inaccurate predictions at low untrustworthiness must be imperatively avoided. This kind of relationship is characteristic for the Neighborhood Behavior (NB) problem: dissimilar molecule pairs may or may not display similar properties, but similar molecule pairs with different properties are explicitly "forbidden". Therefore, statistical tools developed to tackle this latter aspect were applied and lead to a unified AD metric benchmarking scheme. A first use of untrustworthiness scores resides in prioritization of predictions, without the need to specify a hard AD border. Moreover, if a significant set of external compounds is available, the formalism allows optimal AD borderlines to be fitted. Eventually, consensus AD definitions were built by means of a nonparametric mixing scheme of two AD metrics of comparable quality and shown to outperform their respective parents.

[1]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[2]  Pierre Bruneau,et al.  logD7.4 Modeling Using Bayesian Regularized Neural Networks. Assessment and Correction of the Errors of Prediction , 2006, J. Chem. Inf. Model..

[3]  Alexandre Varnek,et al.  QSAR modeling of blood:air and tissue:air partition coefficients using theoretical descriptors. , 2005, Bioorganic & medicinal chemistry.

[4]  Alexandre Varnek,et al.  Modeling of Ion Complexation and Extraction Using Substructural Molecular Fragments , 2000, J. Chem. Inf. Comput. Sci..

[5]  Desire L. Massart,et al.  Methods for outlier detection in prediction , 2002 .

[6]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[7]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[8]  Uwe Hartmann,et al.  Mapping neural network derived from the parzen window estimator , 1992, Neural Networks.

[9]  Paola Gramatica,et al.  Statistically Validated QSARs, Based on Theoretical Descriptors, for Modeling Aquatic Toxicity of Organic Chemicals in Pimephales promelas (Fathead Minnow) , 2005, J. Chem. Inf. Model..

[10]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[11]  Gisbert Schneider,et al.  A pseudo-ligand approach to virtual screening. , 2006, Combinatorial chemistry & high throughput screening.

[12]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13]  Gisbert Schneider,et al.  Identification of Hits and Lead Structure Candidates with Limited Resources by Adaptive Optimization , 2008, J. Chem. Inf. Model..

[14]  Dragos Horvath,et al.  Strengths and Limitations of Pharmacophore‐Based Virtual Screening , 2005 .

[15]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[16]  Benjamin Parent,et al.  Fuzzy Tricentric Pharmacophore Fingerprints, 1. Topological Fuzzy Pharmacophore Triplets and Adapted Molecular Similarity Scoring Schemes , 2006, J. Chem. Inf. Model..

[17]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[18]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[19]  Boris Mirkin,et al.  A Measure of Domain of Applicability for QSAR Modelling Based on Intelligent K-Means Clustering , 2007 .

[20]  Visakan Kadirkamanathan,et al.  Analysis of Neighborhood Behavior in Lead Optimization and Array Design , 2009, J. Chem. Inf. Model..

[21]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[22]  Dragos Horvath,et al.  Fuzzy Tricentric Pharmacophore Fingerprints. 2. Application of Topological Fuzzy Pharmacophore Triplets in Quantitative Structure-Activity Relationships , 2008, J. Chem. Inf. Model..

[23]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[24]  Dragos Horvath,et al.  Neighborhood Behavior of in Silico Structural Spaces with Respect to In Vitro Activity Spaces-A Benchmark for Neighborhood Behavior Assessment of Different in Silico Similarity Metrics , 2003, J. Chem. Inf. Comput. Sci..

[25]  Alexandre Varnek,et al.  Stochastic versus Stepwise Strategies for Quantitative Structure-Activity Relationship GenerationHow Much Effort May the Mining for Successful QSAR Models Take? , 2007, J. Chem. Inf. Model..

[26]  Dragos Horvath,et al.  Neighborhood Behavior of in Silico Structural Spaces with Respect to in Vitro Activity Spaces-A Novel Understanding of the Molecular Similarity Principle in the Context of Multiple Receptor Binding Profiles , 2003, J. Chem. Inf. Comput. Sci..

[27]  I. Tetko,et al.  ISIDA - Platform for Virtual Screening Based on Fragment and Pharmacophoric Descriptors , 2008 .

[28]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[29]  A. Tropsha,et al.  Development and validation of k-nearest-neighbor QSPR models of metabolic stability of drug candidates. , 2003, Journal of medicinal chemistry.

[30]  Alexandre Varnek,et al.  Substructural fragments: an universal language to encode reactions, molecular and supramolecular structures , 2005, J. Comput. Aided Mol. Des..