How to avoid over-fitting in multivariate calibration--the conventional validation approach and an alternative.

This paper critically reviews the problem of over-fitting in multivariate calibration and the conventional validation-based approach to avoid it. It proposes a randomization test that enables one to assess the statistical significance of each component that enters the model. This alternative is compared with cross-validation and independent test set validation for the calibration of a near-infrared spectral data set using partial least squares (PLS) regression. The results indicate that the alternative approach is more objective, since, unlike the validation-based approach, it does not require the use of 'soft' decision rules. The alternative approach therefore appears to be a useful addition to the chemometrician's toolbox.

[1]  S. Wold,et al.  Orthogonal signal correction of near-infrared spectra , 1998 .

[2]  E. V. Thomas,et al.  Non‐parametric statistical methods for multivariate calibration model selection and comparison , 2003 .

[3]  Nicolaas M. Faber,et al.  Estimating the uncertainty in estimates of root mean square error of prediction: application to determining the size of an adequate test set in multivariate calibration , 1999 .

[4]  Hilko van der Voet,et al.  Comparing the predictive accuracy of models using a simple randomization test , 1994 .

[5]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[6]  Rocco DiFoggio,et al.  Examination of Some Misconceptions about Near-Infrared Analysis , 1995 .

[7]  N. M. Faber,et al.  Uncertainty estimation and figures of merit for multivariate calibration (IUPAC Technical Report) , 2006 .

[8]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[9]  Israel Schechter,et al.  Wavelength Selection for Simultaneous Spectroscopic Analysis. Experimental and Theoretical Study , 1996 .

[10]  M. P. Gómez-Carracedo,et al.  Selecting the optimum number of partial least squares components for the calibration of attenuated total reflectance-mid-infrared spectra of undesigned kerosene samples. , 2007, Analytica chimica acta.

[11]  J. Leroy Folks,et al.  The Inverse Gaussian Distribution: Theory: Methodology, and Applications , 1988 .

[12]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[13]  Yu-Long Xie,et al.  Evaluation of principal component selection methods to form a global prediction model by principal component regression , 1997 .

[14]  Stephen R. Delwiche,et al.  SAS® Partial Least Squares Regression for Analysis of Spectroscopic Data , 2003 .

[15]  Pierre Dardenne,et al.  Validation and verification of regression in small data sets , 1998 .

[16]  H. R. Keller,et al.  Evolving factor analysis in the presence of heteroscedastic noise , 1992 .

[17]  Hein Putter,et al.  The bootstrap: a tutorial , 2000 .

[18]  Desire L. Massart,et al.  Estimation of partial least squares regression prediction uncertainty when the reference values carry a sizeable measurement error , 2003 .

[19]  Avraham Lorber,et al.  Alternatives to Cross-Validatory Estimation of the Number of Factors in Multivariate Calibration , 1990 .

[20]  Michael C. Denham,et al.  Choosing the number of factors in partial least squares regression: estimating and minimizing the mean squared error­ of prediction , 2000 .

[21]  U Depczynski,et al.  Genetic algorithms applied to the selection of factors in principal component regression , 2000 .

[22]  J. Kalivas,et al.  Local prediction models by principal component regression , 1997 .