Principles of QSAR models validation: internal and external

The recent REACH Policy of the European Union has led to scientists and regulators to focus their attention on establishing general validation principles for QSAR models in the context of chemical regulation (previously known as the Setubal, nowadays, the OECD principles). This paper gives a brief analysis of some principles: unambiguous algorithm, Applicability Domain (AD), and statistical validation. Some concerns related to QSAR algorithm reproducibility and an example of a fast check of the applicability domain for MLR models are presented. Common myths and misconceptions related to popular techniques for verifying internal predictivity, particularly for MLR models (for instance crossvalidation, bootstrap), are commented on and compared with commonly used statistical techniques for external validation. The differences in the two validating approaches are highlighted, and evidence is presented that only models that have been validated externally, after their internal validation, can be considered reliable and applicable for both external prediction and regulatory purposes. (“Validation is one of those words...that is constantly used and seldom defined” as stated by A. R. Feinstein in the book Multivariate Analysis: An Introduction, Yale University Press, New Haven, 1996).

[1]  Paola Gramatica,et al.  An Update of the BCF QSAR Model Based on Theoretical Molecular Descriptors , 2005 .

[2]  Knut Baumann,et al.  Validation tools for variable subset regression , 2004, J. Comput. Aided Mol. Des..

[3]  J.C. Dearden,et al.  Improved prediction of fish bioconcentration factor of Hydrophobic Chemicals , 2004, SAR and QSAR in environmental research.

[4]  Nikolai S. Zefirov,et al.  QSAR for Boiling Points of "Small" Sulfides. Are the "High-Quality Structure-Property-Activity Regressions" the Real High Quality QSAR Models? , 2001, J. Chem. Inf. Comput. Sci..

[5]  J. Jaworska,et al.  Summary of a workshop on regulatory acceptance of (Q)SARs for human health and environmental endpoints. , 2003, Environmental health perspectives.

[6]  Dan C. Fara,et al.  QSPR Treatment of the Soil Sorption Coefficients of Organic Pollutants , 2005, J. Chem. Inf. Model..

[7]  Han van de Waterbeemd,et al.  Chemometric methods in molecular design , 1995 .

[8]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[9]  W. Tong,et al.  Quantitative structure‐activity relationship methods: Perspectives on drug discovery and toxicology , 2003, Environmental toxicology and chemistry.

[10]  T. Öberg A QSAR for the hydroxyl radical reaction rate constant: validation, domain of application, and prediction , 2005 .

[11]  Knut Baumann,et al.  Cross-validation as the objective function for variable-selection techniques , 2003 .

[12]  E. Papa,et al.  Approaches for externally validated QSAR modelling of Nitrated Polycyclic Aromatic Hydrocarbon mutagenicity , 2007, SAR and QSAR in environmental research.

[13]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[14]  S. Weisberg Plots, transformations, and regression , 1985 .

[15]  Roberto Todeschini,et al.  A new algorithm for optimal, distance based, experimental design , 1992 .

[16]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[17]  Paola Gramatica,et al.  Statistically Validated QSARs, Based on Theoretical Descriptors, for Modeling Aquatic Toxicity of Organic Chemicals in Pimephales promelas (Fathead Minnow) , 2005, J. Chem. Inf. Model..

[18]  Paola Gramatica,et al.  QSAR Prediction of Ozone Tropospheric Degradation , 2003 .

[19]  J. Zupan,et al.  Neural Networks in Chemistry , 1993 .

[20]  P Gramatica,et al.  Ranking of volatile organic compounds for tropospheric degradability by oxidants: A QSPR approach , 2002, SAR and QSAR in environmental research.

[21]  J. Ruuskanen,et al.  Performance of (consensus) kNN QSAR for predicting estrogenic activity in a large diverse set of organic compounds , 2004, SAR and QSAR in environmental research.

[22]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[23]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[24]  Hein Putter,et al.  The bootstrap: a tutorial , 2000 .

[25]  E Benfenati,et al.  Predicting logP of pesticides using different software. , 2003, Chemosphere.

[26]  Philip Howard,et al.  Practical considerations on the use of predictive models for regulatory purposes. , 2005, Environmental science & technology.

[27]  Peter C Jurs,et al.  Assessing the reliability of a QSAR model's predictions. , 2005, Journal of molecular graphics & modelling.

[28]  Paola Gramatica,et al.  Predicting the NO3 radical tropospheric degradability of organic pollutants by theoretical molecular descriptors , 2003 .

[29]  Paola Gramatica,et al.  QSAR Modeling of Bioconcentration Factor by theoretical molecular descriptors , 2003 .

[30]  Paola Gramatica,et al.  A tool for the assessment of VOC degradability by tropospheric oxidants starting from chemical structure , 2004 .

[31]  Rebecca Renner,et al.  The Kow controversy. , 2002, Environmental science & technology.

[32]  Tomas Öberg,et al.  A QSAR for Baseline Toxicity: Validation, Domain of Application, and Prediction , 2004 .

[33]  Jure Zupan,et al.  Kohonen and counterpropagation artificial neural networks in analytical chemistry , 1997 .

[34]  Paola Gramatica,et al.  Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. , 2003, Environmental health perspectives.

[35]  S. Wold,et al.  Statistical Validation of QSAR Results , 1995 .

[36]  Hugo Kubinyi,et al.  From Narcosis to Hyperspace: The History of QSAR , 2002 .

[37]  Alexander Golbraikh,et al.  Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection , 2002, J. Comput. Aided Mol. Des..

[38]  Rajarshi Guha,et al.  Generation of QSAR sets with a self-organizing map. , 2004, Journal of molecular graphics & modelling.

[39]  Alexander Golbraikh,et al.  Rational selection of training and test sets for the development of validated QSAR models , 2003, J. Comput. Aided Mol. Des..

[40]  Paola Gramatica,et al.  Validated QSAR Prediction of OH Tropospheric Degradation of VOCs: Splitting into Training-Test Sets and Consensus Modeling , 2004, J. Chem. Inf. Model..

[41]  Mark T. D. Cronin,et al.  Predicting Chemical Toxicity and Fate , 2004 .

[42]  Paola Gramatica,et al.  Statistical external validation and consensus modeling: a QSPR case study for Koc prediction. , 2007, Journal of molecular graphics & modelling.

[43]  P Gramatica,et al.  Prediction of aromatic amines mutagenicity from theoretical molecular descriptors , 2003, SAR and QSAR in environmental research.

[44]  Shijin Ren,et al.  Modeling the Toxicity of Aromatic Compounds to Tetrahymena pyriformis: The Response Surface Methodology with Nonlinear Methods , 2003, J. Chem. Inf. Comput. Sci..

[45]  E. Hulzebos,et al.  (Q)SARS: gatekeepers against risk on chemicals? , 2003, SAR and QSAR in environmental research.

[46]  C Helma Data Mining and Knowledge Discovery in Predictive Toxicology , 2004, SAR and QSAR in environmental research.

[47]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .