Assessing Model Fit by Cross-Validation

When QSAR models are fitted, it is important to validate any fitted model-to check that it is plausible that its predictions will carry over to fresh data not used in the model fitting exercise. There are two standard ways of doing this-using a separate hold-out test sample and the computationally much more burdensome leave-one-out cross-validation in which the entire pool of available compounds is used both to fit the model and to assess its validity. We show by theoretical argument and empiric study of a large QSAR data set that when the available sample size is small-in the dozens or scores rather than the hundreds, holding a portion of it back for testing is wasteful, and that it is much better to use cross-validation, but ensure that this is done properly.

[1]  M. Stone Asymptotics for and against cross-validation , 1977 .

[2]  J. Kiefer,et al.  Time- and Space-Saving Computer Methods, Related to Mitchell's DETMAX, for Finding D-Optimum Designs , 1980 .

[3]  Ker-Chau Li,et al.  Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing , 1986 .

[4]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[5]  Subhash C. Basak,et al.  Use of Topostructural, Topochemical, and Geometric Parameters in the Prediction of Vapor Pressure: A Hierarchical QSAR Approach , 1997, J. Chem. Inf. Comput. Sci..

[6]  Douglas M. Hawkins,et al.  A faster algorithm for ridge regression of reduced rank data , 2002 .

[7]  C. I. Mosier I. Problems and Designs of Cross-Validation 1 , 1951 .

[8]  William J. Welch,et al.  Uniform Coverage Designs for Molecule Selection , 2002, Technometrics.

[9]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[10]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[11]  Ker-Chau Li,et al.  Regression Analysis Under Link Violation , 1989 .

[12]  Subhash C. Basak,et al.  Quantitative Structure-Property Relationships (QSPRs) for the Estimation of Vapor Pressure: A Hierarchical Approach Using Mathematical Structural Descriptors , 2001, J. Chem. Inf. Comput. Sci..

[13]  M. R. Mickey,et al.  Estimation of Error Rates in Discriminant Analysis , 1968 .

[14]  Douglas M. Hawkins,et al.  QSAR with Few Compounds and Many Features , 2001, J. Chem. Inf. Comput. Sci..

[15]  A. Tropsha,et al.  Beware of q 2 , 2002 .

[16]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[17]  Ettore Novellino,et al.  Use of comparative molecular field analysis and cluster analysis in series design , 1995 .

[18]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[19]  Jason L. Loeppky,et al.  Augmenting Scheffé Linear Mixture Models with Squared and/or Crossproduct Terms , 2002 .

[20]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[21]  Bernd Droge,et al.  Asymptotic optimality of full cross-validation for selecting linear regression models , 1999 .