Reducing over-optimism in variable selection by cross-model validation

Abstract Extensive optimisation of a mathematical model's fit to a relatively small set of empirical data, may lead to over-optimistic validation results. If the assessment of the final, optimised model is based on the same validation method and the same input data that were used as basis for the extensive model optimisation, accumulated spurious correlations may appear as real predictive ability in the final model validation. An example of this is the use of extensive variable selection in multiple regression, based on a cross-model validation scheme. To illustrate the over-optimism problem in optimisation based on conventional one-layered validation, an artificial data set, with only random numbers was submitted to regression modelling. The model was optimised by stepwise variable selection. A very good apparent predictive ability for y from X was found in the final model by leave-one-out cross-validation (84%), after the number of X-variables had been reduced stepwise from 500 to 29. Finally, the performance of the cross-model validation is tested on one large QSAR data set. Several calibration sets were chosen randomly and a regression model optimised by variable selection. The prediction accuracy of these models was compared to the cross-validation and cross-model validation results. In these tests cross-model validation gives the better measure of model predictive ability.

[1]  R. Cramer,et al.  Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. , 1988, Journal of the American Chemical Society.

[2]  Erik Johansson,et al.  Multivariate design and modeling in QSAR , 1996 .

[3]  Kim H. Esbensen,et al.  Multivariate data analysis: quo vadis? , 2003 .

[4]  M. Frierson,et al.  Discovery of potent cyclic GMP phosphodiesterase inhibitors. 2-Pyridyl- and 2-imidazolylquinazolines possessing cyclic GMP phosphodiesterase and thromboxane synthesis inhibitory activities. , 1995, Journal of medicinal chemistry.

[5]  T. Saeki,et al.  Cyclic GMP phosphodiesterase inhibitors. 1. The discovery of a novel potent inhibitor, 4-((3,4-(methylenedioxy)benzyl)amino)-6,7,8-trimethoxyquinazoline. , 1993, Journal of medicinal chemistry.

[6]  H. Martens,et al.  Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR) , 2000 .

[7]  H. Scheffé A Statistical Theory of Calibration , 1973 .

[8]  B. Kowalski,et al.  Review of Chemometrics Applied to Spectroscopy: 1985-95, Part I , 1996 .

[9]  Robert Tibshirani,et al.  Computer‐Intensive Statistical Methods , 2006 .

[10]  Pierre Dardenne,et al.  Validation and verification of regression in small data sets , 1998 .

[11]  P. Labute,et al.  Flexible alignment of small molecules. , 2001, Journal of medicinal chemistry.

[12]  G. Klebe,et al.  Molecular similarity indices in a comparative analysis (CoMSIA) of drug molecules to correlate and predict their biological activity. , 1994, Journal of medicinal chemistry.

[13]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[14]  Allan M. Ferguson,et al.  EVA: A new theoretically based molecular descriptor for use in QSAR/QSPR analysis , 1997, J. Comput. Aided Mol. Des..

[15]  Paul Geladi,et al.  Strategies for multivariate image regression , 1992 .

[16]  Thuy Dao,et al.  Comparative Spectra Analysis (CoSA): Spectra as Three-Dimensional Molecular Descriptors for the Prediction of Biological Activities , 1999, J. Chem. Inf. Comput. Sci..

[17]  E. Dougherty,et al.  Multivariate measurement of gene expression relationships. , 2000, Genomics.