Statistical variation in progressive scrambling

AbstractThe two methods most often used to evaluate the robustness and predictivity of partial least squares (PLS) models are cross-validation and response randomization. Both methods may be overly optimistic for data sets that contain redundant observations, however. The kinds of perturbation analysis widely used for evaluating model stability in the context of ordinary least squares regression are only applicable when the descriptors are independent of each other and errors are independent and normally distributed; neither assumption holds for QSAR in general and for PLS in particular. Progressive scrambling is a novel, non-parametric approach to perturbing models in the response space in a way that does not disturb the underlying covariance structure of the data. Here, we introduce adjustments for two of the characteristic values produced by a progressive scrambling analysis -- the deprecated predictivity ($Q_{\rm s}^{\ast^2}$) and standard error of prediction (SDEPs*) -- that correct for the effect of introduced perturbation. We also explore the statistical behavior of the adjusted values ($Q_{\rm 0}^{\ast^2}$ and SDEP0*) and the sensitivity to perturbation (dq2/dryy ′2). It is shown that the three statistics are all robust for stable PLS models, in terms of the stochastic component of their determination and of their variation due to sampling effects involved in training set selection.

[1]  Matthew Clark,et al.  Comparative molecular field analysis (CoMFA). 2. Toward its use with 3D-structural databases , 1990 .

[2]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[3]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[4]  D. Lesieur,et al.  Three-dimensional quantitative structure-activity relationships of cyclo-oxygenase-2 (COX-2) inhibitors: a comparative molecular field analysis. , 2001, Journal of medicinal chemistry.

[5]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[6]  Abby L. Parrill,et al.  Rational drug design : novel methodology and practical applications , 1999 .

[7]  Robert D. Clark,et al.  Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics , 2003, J. Comput. Aided Mol. Des..

[8]  R. Cramer,et al.  Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. , 1988, Journal of the American Chemical Society.

[9]  Han van de Waterbeemd,et al.  Chemometric Methods in Molecular Design: van de Waterbeemd/Chemometric , 1995 .

[10]  K. Baumann,et al.  A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part II. Practical applications , 2002 .

[11]  Juan M. Luco,et al.  QSAR Based on Multiple Linear Regression and PLS Methods for the Anti-HIV Activity of a Large Group of HEPT Derivatives , 1997, J. Chem. Inf. Comput. Sci..

[12]  David R. Lowis Molecular Hologram QSAR , 1999 .

[13]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[14]  Hilko van der Voet,et al.  Pseudo-degrees of freedom for complex predictive models: the example of partial least squares , 1999 .

[15]  C. Monneret,et al.  A 3D QSAR study of a series of HEPT analogues: the influence of conformational mobility on HIV-1 reverse transcriptase inhibition. , 1997, Journal of medicinal chemistry.

[16]  John H. Kalivas,et al.  QSAR modeling based on the bias/variance compromise: a harmonious , 2004, J. Comput. Aided Mol. Des..

[17]  Bruce L. Bush,et al.  Sample-distance partial least squares: PLS optimized for many variables, with application to CoMFA , 1993, J. Comput. Aided Mol. Des..

[18]  Johann Gasteiger,et al.  Neural networks in chemistry and drug design , 1999 .