Robust Cross-Validation of Linear Regression QSAR Models

A quantitative structure-activity relationship (QSAR) model is typically developed to predict the biochemical activity of untested compounds from the compounds' molecular structures. "The gold standard" of model validation is the blindfold prediction when the model's predictive power is assessed from how well the model predicts the activity values of compounds that were not considered in any way during the model development/calibration. However, during the development of a QSAR model, it is necessary to obtain some indication of the model's predictive power. This is often done by some form of cross-validation (CV). In this study, the concepts of the predictive power and fitting ability of a multiple linear regression (MLR) QSAR model were examined in the CV context allowing for the presence of outliers. Commonly used predictive power and fitting ability statistics were assessed via Monte Carlo cross-validation when applied to percent human intestinal absorption, blood-brain partition coefficient, and toxicity values of saxitoxin QSAR data sets, as well as three known benchmark data sets with known outlier contamination. It was found that (1) a robust version of MLR should always be preferred over the ordinary-least-squares MLR, regardless of the degree of outlier contamination and that (2) the model's predictive power should only be assessed via robust statistics. The Matlab and java source code used in this study is freely available from the QSAR-BENCH section of www.dmitrykonovalov.org for academic use. The Web site also contains the java-based QSAR-BENCH program, which could be run online via java's Web Start technology (supporting Windows, Mac OSX, Linux/Unix) to reproduce most of the reported results or apply the reported procedures to other data sets.

[1]  Yvan Vander Heyden,et al.  Benchmarking of QSAR Models for Blood-Brain Barrier Permeation , 2007, J. Chem. Inf. Model..

[2]  Nigel Sim,et al.  Statistical Confidence for Variable Selection in QSAR Models via Monte Carlo Cross-Validation , 2008, J. Chem. Inf. Model..

[3]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  S. Wold,et al.  Statistical Validation of QSAR Results , 1995 .

[6]  David J. Olive,et al.  High Breakdown Multivariate Estimators , 2008 .

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Stefan Van Aelst,et al.  Machine Learning and Robust Data Mining , 2007, Comput. Stat. Data Anal..

[9]  Ping Zhang Model Selection Via Multifold Cross Validation , 1993 .

[10]  Y. Martin,et al.  Do structurally similar molecules have similar biological activity? , 2002, Journal of medicinal chemistry.

[11]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[12]  Yiannis N. Kaznessis,et al.  Prediction of blood-brain partitioning using Monte Carlo simulations of molecules in water , 2001, J. Comput. Aided Mol. Des..

[13]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[14]  R. Koenker,et al.  Asymptotic Theory of Least Absolute Error Regression , 1978 .

[15]  Igor V Tetko,et al.  Computing chemistry on the web. , 2005, Drug discovery today.

[16]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[17]  M. Abraham,et al.  A data base for partition of volatile organic compounds and drugs from blood/plasma/serum to brain, and an LFER analysis of the data. , 2006, Journal of pharmaceutical sciences.

[18]  Howard Wainer,et al.  Robust Regression & Outlier Detection , 1988 .

[19]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[20]  D. Ruppert Robust Statistics: The Approach Based on Influence Functions , 1987 .

[21]  P. Holland,et al.  Robust regression using iteratively reweighted least-squares , 1977 .

[22]  Anne Hersey,et al.  On the mechanism of human intestinal absorption. , 2002, European journal of medicinal chemistry.

[23]  W. Symes,et al.  Robust inversion of seismic data using the Huber norm , 2003 .

[24]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[25]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[26]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[27]  Jerzy Leszczynski,et al.  QSAR Modeling of Acute Toxicity for Nitrobenzene Derivatives Towards Rats: Comparative Analysis by MLRA and Optimal Descriptors , 2007 .

[28]  Douglas M. Hawkins,et al.  Behavior of elemental sets in regression , 2007 .

[29]  W. L. Jorgensen,et al.  Prediction of Properties from Simulations: Free Energies of Solvation in Hexadecane, Octanol, and Water , 2000 .

[30]  J RousseeuwPeter,et al.  Computing LTS Regression for Large Data Sets , 2006 .

[31]  G. L. Shevlyakov,et al.  On Robust estimation of a correlation coefficient , 1997 .

[32]  C. I. Mosier I. Problems and Designs of Cross-Validation 1 , 1951 .

[33]  L B Kier,et al.  General definition of valence delta-values for molecular connectivity. , 1983, Journal of pharmaceutical sciences.

[34]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[35]  Elvezio Ronchetti,et al.  Robust Linear Model Selection by Cross-Validation , 1997 .

[36]  Arup K. Ghose,et al.  Atomic physicochemical parameters for three dimensional structure directed quantitative structure-activity relationships. 4. Additional parameters for hydrophobic and dispersive interactions and their application for an automated superposition of certain naturally occurring nucleoside antibiotics , 1989, J. Chem. Inf. Comput. Sci..

[37]  Douglas M. Hawkins,et al.  Applications and algorithms for least trimmed sum of absolute deviations regression , 1999 .

[38]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[39]  Romualdo Benigni,et al.  Predictivity of QSAR , 2008, J. Chem. Inf. Model..

[40]  M. Cowles An R and S-PLUS Companion to Applied Regression , 2003 .

[41]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[42]  U. Ligges Review of An R and S-PLUS companion to applied regression by J. Fox, Sage Publications, Thousand Oaks, California 2002 , 2003 .

[43]  Tingjun Hou,et al.  Development of Reliable Aqueous Solubility Models and Their Application in Druglike Analysis , 2007, J. Chem. Inf. Model..

[44]  L. Llewellyn,et al.  Predictive toxinology: an initial foray using calculated molecular descriptors to describe toxicity using saxitoxins as a model. , 2007, Toxicon : official journal of the International Society on Toxinology.

[45]  J. Fox Nonparametric Regression Appendix to An R and S-PLUS Companion to Applied Regression , 2002 .

[46]  O. Hössjer Rank-Based Estimates in the Linear Model with High Breakdown Point , 1994 .

[47]  J. Shao Bootstrap Model Selection , 1996 .

[48]  David J. Olive,et al.  Inconsistency of Resampling Algorithms for High-Breakdown Regression Estimators and a New Algorithm , 2002 .

[49]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[50]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[51]  G Beck,et al.  Evaluation of human intestinal absorption data and subsequent derivation of a quantitative structure-activity relationship (QSAR) with the Abraham descriptors. , 2001, Journal of pharmaceutical sciences.

[52]  S. J. Devlin,et al.  Robust estimation and outlier detection with correlation coefficients , 1975 .

[53]  D. Ruppert Computing S Estimators for Regression and Multivariate Location/Dispersion , 1992 .

[54]  A. Tropsha,et al.  Beware of q 2 , 2002 .

[55]  A. Ghose,et al.  Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragmental Methods: An Analysis of ALOGP and CLOGP Methods , 1998 .

[56]  L B Kier,et al.  Derivation and significance of valence molecular connectivity. , 1981, Journal of pharmaceutical sciences.

[57]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[58]  T. O. Kvålseth Cautionary Note about R 2 , 1985 .

[59]  Peter J. Rousseeuw,et al.  Applying robust regression to insurance , 1984 .

[60]  M. Hubert,et al.  A robust PCR method for high‐dimensional regressors , 2003 .

[61]  Igor V. Tetko,et al.  Virtual Computational Chemistry Laboratory – Design and Description , 2005, J. Comput. Aided Mol. Des..

[62]  P. Selzer,et al.  Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. , 2000, Journal of medicinal chemistry.