论文信息 - Assessing Model Fit by Cross-Validation - 字舞流文

Assessing Model Fit by Cross-Validation

When QSAR models are fitted, it is important to validate any fitted model-to check that it is plausible that its predictions will carry over to fresh data not used in the model fitting exercise. There are two standard ways of doing this-using a separate hold-out test sample and the computationally much more burdensome leave-one-out cross-validation in which the entire pool of available compounds is used both to fit the model and to assess its validity. We show by theoretical argument and empiric study of a large QSAR data set that when the available sample size is small-in the dozens or scores rather than the hundreds, holding a portion of it back for testing is wasteful, and that it is much better to use cross-validation, but ensure that this is done properly.

Douglas M. Hawkins | Subhash C. Basak | Denise R. Mills | D. Hawkins | S. Basak | D. Mills

[1] M. Stone. Asymptotics for and against cross-validation , 1977 .

[2] J. Kiefer,et al. Time- and Space-Saving Computer Methods, Related to Mitchell's DETMAX, for Finding D-Optimum Designs , 1980 .

[3] Ker-Chau Li,et al. Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing , 1986 .

[4] M. Stone,et al. Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[5] Subhash C. Basak,et al. Use of Topostructural, Topochemical, and Geometric Parameters in the Prediction of Vapor Pressure: A Hierarchical QSAR Approach , 1997, J. Chem. Inf. Comput. Sci..

[6] Douglas M. Hawkins,et al. A faster algorithm for ridge regression of reduced rank data , 2002 .

[7] C. I. Mosier. I. Problems and Designs of Cross-Validation 1 , 1951 .

[8] William J. Welch,et al. Uniform Coverage Designs for Molecule Selection , 2002, Technometrics.

[9] J. Shao. Linear Model Selection by Cross-validation , 1993 .

[10] B. Efron. The jackknife, the bootstrap, and other resampling plans , 1987 .

[11] Ker-Chau Li,et al. Regression Analysis Under Link Violation , 1989 .

[12] Subhash C. Basak,et al. Quantitative Structure-Property Relationships (QSPRs) for the Estimation of Vapor Pressure: A Hierarchical Approach Using Mathematical Structural Descriptors , 2001, J. Chem. Inf. Comput. Sci..

[13] M. R. Mickey,et al. Estimation of Error Rates in Discriminant Analysis , 1968 .

[14] Douglas M. Hawkins,et al. QSAR with Few Compounds and Many Features , 2001, J. Chem. Inf. Comput. Sci..

[15] A. Tropsha,et al. Beware of q 2 , 2002 .

[16] David M. Allen,et al. The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[17] Ettore Novellino,et al. Use of comparative molecular field analysis and cluster analysis in series design , 1995 .

[18] Seymour Geisser,et al. The Predictive Sample Reuse Method with Applications , 1975 .

[19] Jason L. Loeppky,et al. Augmenting Scheffé Linear Mixture Models with Squared and/or Crossproduct Terms , 2002 .

[20] J. Friedman,et al. A Statistical View of Some Chemometrics Regression Tools , 1993 .

[21] Bernd Droge,et al. Asymptotic optimality of full cross-validation for selecting linear regression models , 1999 .