Variable selection in random calibration of near‐infrared instruments: ridge regression and partial least squares regression settings

Standard methods for calibration of near‐infrared instruments, such as partial least‐squares (PLS) and ridge regression (RR), typically use the full set of wavelengths in the model. In this paper we investigate the effect of variable (wavelength) selection for these two methods on the model prediction. For RR the selection is optimized with respect to the ridge parameter, the number of variables and the configuration of the variables in the model. A fast iterative computational algorithm is developed for the purpose of this optimization. For PLS the selection is optimized with respect to the number of components, the number of variables and the configuration of the variables. We use three real data sets in this study: processed milk from the market, milk from a dairy farm and milk from the production line of a milk processing factory. The quantity of interest is the concentration of fat in the milk. The observations are randomly split into estimation and validation sets. Optimization is based on the mean square prediction error computed on the validation set. The results indicate that the wavelength selection will not always give better prediction than using all of the available wavelengths. Investigation of the information in the spectra is necessary to determine whether all of them are relevant to the objective of the model. Copyright © 2003 John Wiley & Sons, Ltd.

[1]  Ronald D. Snee,et al.  Validation of Regression Models: Methods and Examples , 1977 .

[2]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[3]  Arthur E. Hoerl,et al.  Practical Use of Ridge Regression: A Challenge Met , 1985 .

[4]  Douglas B. Kell,et al.  Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry , 1997 .

[5]  P. Garthwaite An Interpretation of Partial Least Squares , 1994 .

[6]  T. Fearn A Misuse of Ridge Regression in the Calibration of a Near Infrared Reflectance Instrument , 1983 .

[7]  H. Martens,et al.  Comparison of Linear Statistical Methods for Calibration of Nir Instruments , 1986 .

[8]  Rainer Künnemeyer,et al.  Method of Wavelength Selection for Partial Least Squares , 1997 .

[9]  M. C. U. Araújo,et al.  The successive projections algorithm for variable selection in spectroscopic multicomponent analysis , 2001 .

[10]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[11]  P. Brown,et al.  Multivariate Calibration With More Variables Than Observations , 1989 .

[12]  Jian Huang,et al.  A comparison of calibration methods based on calibration data size and robustness , 2002 .

[13]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[14]  A. Höskuldsson Variable and subset selection in PLS regression , 2001 .

[15]  A. Höskuldsson PLS regression methods , 1988 .

[16]  H. Martens,et al.  Variable Selection in near Infrared Spectroscopy Based on Significance Testing in Partial Least Squares Regression , 2000 .

[17]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[18]  Svante Wold,et al.  Partial least-squares method for spectrofluorimetric analysis of mixtures of humic acid and lignin sulfonate , 1983 .