The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models

This paper emphasizes the importance of rigorous validation as a crucial, integral component of Quantitative Structure Property Relationship (QSPR) model development. We consider some examples of published QSPR models, which in spite of their high fitted accuracy for the training sets and apparent mechanistic appeal, fail rigorous validation tests, and, thus, may lack practical utility as reliable screening tools. We present a set of simple guidelines for developing validated and predictive QSPR models. To this end, we discuss several validation strategies including (1) randomization of the modelled property, also called Y-scrambling, (2) multiple leave-many-out cross-validations, and (3) external validation using rational division of a dataset into training and test sets. We also highlight the need to establish the domain of model applicability in the chemical space to flag molecules for which predictions may be unreliable, and discuss some algorithms that can be used for this purpose. We advocate the broad use of these guidelines in the development of predictive QSPR models.

[1]  Chunsheng Yin,et al.  Structure-activity relationships and response-surface analysis of nitroaromatics toxicity to the yeast (Saccharomyces cerevisiae). , 2002, Chemosphere.

[2]  Nikolai S. Zefirov,et al.  QSAR for Boiling Points of "Small" Sulfides. Are the "High-Quality Structure-Property-Activity Regressions" the Real High Quality QSAR Models? , 2001, J. Chem. Inf. Comput. Sci..

[3]  S C Basak,et al.  Prediction of Mutagenicity Utilizing A Hierarchical Qsar Approach , 2001, SAR and QSAR in environmental research.

[4]  John C. Dearden,et al.  A NOTE OF CAUTION TO USERS OF ECOSAR , 1999 .

[5]  M T D Cronin,et al.  Quantitative structure-permeability relationships (QSPRs) for percutaneous absorption. , 2002, Toxicology in vitro : an international journal published in association with BIBRA.

[6]  John Mandel,et al.  The Regression Analysis of Collinear Data. , 1986, Journal of research of the National Bureau of Standards.

[7]  Roger E. Critchlow,et al.  Beyond mere diversity: tailoring combinatorial libraries for drug discovery. , 1999, Journal of combinatorial chemistry.

[8]  J. Stegeman,et al.  Cytochrome P450 gene diversity and function in marine animals: past, present, and future , 2000 .

[9]  Toby J. Mitchell,et al.  An algorithm for the construction of “ D -optimal” experimental designs , 2000 .

[10]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[11]  J N Weinstein,et al.  Quantitative structure-antitumor activity relationships of camptothecin analogues: cluster analysis and genetic algorithm-based studies. , 2001, Journal of medicinal chemistry.

[12]  K. Neve,et al.  CoMFA-based prediction of agonist affinities at recombinant wild type versus serine to alanine point mutated D2 dopamine receptors. , 2000, Journal of medicinal chemistry.

[13]  F. Burden,et al.  Robust QSAR models using Bayesian regularized neural networks. , 1999, Journal of medicinal chemistry.

[14]  Han van de Waterbeemd,et al.  Chemometric Methods in Molecular Design: van de Waterbeemd/Chemometric , 1995 .

[15]  Alexander Golbraikh,et al.  Molecular Dataset Diversity Indices and Their Applications to Comparison of Chemical Databases and QSAR Analysis , 2000, J. Chem. Inf. Comput. Sci..

[16]  Frank R. Burden,et al.  Use of Automatic Relevance Determination in QSAR Studies Using Bayesian Neural Networks , 2000, J. Chem. Inf. Comput. Sci..

[17]  R Benigni,et al.  Quantitative structure-activity relationships of mutagenic and carcinogenic aromatic amines. , 2000, Chemical reviews.

[18]  H Matter,et al.  Random or rational design? Evaluation of diverse compound subsets from chemical structure databases. , 1998, Journal of medicinal chemistry.

[19]  Eugene A. Coats,et al.  The CoMFA Steroids as a Benchmark Dataset for Development of 3D QSAR Methods , 1998 .

[20]  Alan J. Miller,et al.  A Fedorov Exchange Algorithm for D-optimal Design , 1994 .

[21]  D. L. Massart,et al.  Optimization in Irregularly Shaped Regions: pH and Solvent Strength in Reversed-Phase High-Performance Liquid Chromatography Separations , 1994 .

[22]  Erik Johansson,et al.  Multivariate design and modeling in QSAR , 1996 .

[23]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[24]  R. Cramer,et al.  Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. , 1988, Journal of the American Chemical Society.

[25]  P. Gramatica,et al.  Modelling and prediction of soil sorption coefficients of non-ionic organic pesticides by molecular descriptors. , 2000, Chemosphere.

[26]  John D. Walker,et al.  Structure Activity Relationships For Predicting Ecological Effects Of Chemicals , 2002 .

[27]  Robin Taylor,et al.  Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals , 1995, J. Chem. Inf. Comput. Sci..

[28]  J. Zupan,et al.  Neural Networks in Chemistry , 1993 .

[29]  P Willett,et al.  Comparison of algorithms for dissimilarity-based compound selection. , 1997, Journal of molecular graphics & modelling.

[30]  D. Massart,et al.  Application of Nonlinear Regression Functions for the Modeling of Retention in Reversed-Phase LC , 1994 .

[31]  Z. Szántó,et al.  Comparative three-dimensional quantitative structure-activity relationship study of safeners and herbicides. , 2000, Journal of agricultural and food chemistry.

[32]  David Hartsough,et al.  Toward an Optimal Procedure for Variable Selection and QSAR Model Building , 2001, J. Chem. Inf. Comput. Sci..

[33]  Paola Gramatica,et al.  QSAR study on the tropospheric degradation of organic compounds , 1999 .

[34]  Svante Wold,et al.  Partial least-squares method for spectrofluorimetric analysis of mixtures of humic acid and lignin sulfonate , 1983 .

[35]  Alexander Tropsha,et al.  Novel Variable Selection Quantitative Structure-Property Relationship Approach Based on the k-Nearest-Neighbor Principle , 2000, J. Chem. Inf. Comput. Sci..

[36]  Ruth Pachter,et al.  Improved QSARs for Predictive Toxicology of Halogenated Hydrocarbons , 2000, Comput. Chem..

[37]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[38]  S. Weisberg Plots, transformations, and regression , 1985 .

[39]  Roberto Todeschini,et al.  A new algorithm for optimal, distance based, experimental design , 1992 .

[40]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[41]  Ettore Novellino,et al.  Use of comparative molecular field analysis and cluster analysis in series design , 1995 .

[42]  Ulf Norinder,et al.  Single and domain mode variable selection in 3D QSAR applications , 1996 .

[43]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[44]  S. Wold,et al.  Statistical Validation of QSAR Results , 1995 .

[45]  Gregory W. Kauffman,et al.  QSAR and k-Nearest Neighbor Classification Analysis of Selective Cyclooxygenase-2 Inhibitors Using Topologically-Based Numerical Descriptors , 2001, J. Chem. Inf. Comput. Sci..

[46]  H. Kubinyi,et al.  Three-dimensional quantitative similarity-activity relationships (3D QSiAR) from SEAL similarity matrices. , 1998, Journal of medicinal chemistry.

[47]  Takahiro Suzuki,et al.  Classification of Environmental Estrogens by Physicochemical Properties Using Principal Component Analysis and Hierarchical Cluster Analysis , 2001, J. Chem. Inf. Comput. Sci..

[48]  Desire L. Massart,et al.  Artificial neural networks in classification of NIR spectral data: Design of the training set , 1996 .

[49]  Peter C. Jurs,et al.  Development of Quantitative Structure-Activity Relationship and Classification Models for a Set of Carbonic Anhydrase Inhibitors , 2002, J. Chem. Inf. Comput. Sci..

[50]  M T D Cronin,et al.  The importance of hydrophobicity and electrophilicity descriptors in mechanistically-based QSARs for toxicological endpoints , 2002, SAR and QSAR in environmental research.

[51]  Sung Jin Cho,et al.  Rational Combinatorial Library Design. 2. Rational Design of Targeted Combinatorial Peptide Libraries Using Chemical Similarity Probe and the Inverse QSAR Approaches , 1998, J. Chem. Inf. Comput. Sci..

[52]  T W Schultz,et al.  Structure-toxicity relationships for selected halogenated aliphatic chemicals. , 1999, Environmental toxicology and pharmacology.

[53]  J Devillers,et al.  QSAR Modeling of Large Heterogeneous Sets of Molecules , 2001, SAR and QSAR in environmental research.

[54]  Milan Randic,et al.  Construction of High-Quality Structure-Property-Activity Regressions: The Boiling Points of Sulfides , 2000, J. Chem. Inf. Comput. Sci..