The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration

Classical least squares (CLS) and partial least squares (PLS) are two common multivariate regression algorithms in chemometrics. This paper presents an asymptotically exact mathematical analysis of the mean squared error of prediction of CLS and PLS under the linear mixture model commonly assumed in spectroscopy. For CLS regression with a very large calibration set the root mean squared error is approximately equal to the noise per wavelength divided by the length of the net analyte signal vector. It is shown, however, that for a finite training set with n samples in p dimensions there are additional error terms that depend on σ2p2/n2, where σ is the noise level per co‐ordinate. Therefore in the ‘large p—small n’ regime, common in spectroscopy, these terms can be quite large and even dominate the overall prediction error. It is demonstrated both theoretically and by simulations that dimensional reduction of the input data via their compact representation with a few features, selected for example by adaptive wavelet compression, can substantially decrease these effects and recover the asymptotic error. This analysis provides a theoretical justification for the need to perform feature selection (dimensional reduction) of the input data prior to application of multivariate regression algorithms. Copyright © 2005 John Wiley & Sons, Ltd.

[1]  S. Wold,et al.  PLS regression on wavelet compressed NIR spectra , 1998 .

[2]  I Itzkan,et al.  An enhanced algorithm for linear multivariate calibration. , 1998, Analytical chemistry.

[3]  P. J. Brown,et al.  Calibration with Many Variables , 1993 .

[4]  C. Spiegelman,et al.  Theoretical Justification of Wavelength Selection in PLS Calibration:  Development of a New Algorithm. , 1998, Analytical Chemistry.

[5]  David L. Donoho,et al.  Improved linear discrimination using time-frequency dictionaries , 1995, Optics + Photonics.

[6]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[7]  Bruce R. Kowalski,et al.  The effect of mean centering on prediction in multivariate calibration , 1992 .

[8]  R. Sundberg When is the inverse regression estimator MSE-superior to the standard regression estimator in multivariate controlled calibration situations? , 1985 .

[9]  M. Srivastava,et al.  Exact mean squared error comparisons of the inverse and classical estimators in multi-univariate linear calibration , 1996 .

[10]  Clifford H. Spiegelman,et al.  Chemometrics and spectral frequency selection , 1991, Philosophical Transactions of the Royal Society of London. Series A: Physical and Engineering Sciences.

[11]  C. Pidgeon,et al.  Phospholipid immobilization on solid surfaces. , 1994, Analytical chemistry.

[12]  E. V. Thomas,et al.  COMPARISON OF MULTIVARIATE CALIBRATION METHODS FOR QUANTITATIVE SPECTRAL ANALYSIS , 1990 .

[13]  D. Haaland 8 – Multivariate Calibration Methods Applied to Quantitative FT-IR Analyses , 1990 .

[14]  David M. Haaland,et al.  New Prediction-Augmented Classical Least-Squares (PACLS) Methods: Application to Unmodeled Interferents , 2000 .

[15]  L. Gleser Measurement, Regression, and Calibration , 1996 .

[16]  Beata Walczak,et al.  Comparison of Multivariate Calibration Techniques Applied to Experimental NIR Data Sets , 2000 .

[17]  A. Lorber Error propagation and figures of merit for quantification by solving matrix equations , 1986 .

[18]  Anestis Antoniadis,et al.  Dimension reduction in functional regression with applications , 2006, Comput. Stat. Data Anal..

[19]  J. Tellinghuisen,et al.  Inverse vs. classical calibration for small data sets , 2000, Fresenius' journal of analytical chemistry.

[20]  Ronald R. Coifman,et al.  Partial least squares, Beer's law and the net analyte signal: statistical modeling and analysis , 2005 .

[21]  Avraham Lorber,et al.  Estimation of prediction error for multivariate calibration , 1988 .

[22]  Ronald R. Coifman,et al.  On local orthonormal bases for classification and regression , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[23]  Israel Schechter,et al.  Wavelength Selection for Simultaneous Spectroscopic Analysis. Experimental and Theoretical Study , 1996 .

[24]  Avraham Lorber,et al.  The effect of interferences and calbiration design on accuracy: Implications for sensor and sample selection , 1988 .

[25]  Pekka Teppola,et al.  Wavelet–PLS regression models for both exploratory data analysis and process monitoring , 2000 .

[26]  S. Raudys,et al.  Results in statistical discriminant analysis: a review of the former Soviet union literature , 2004 .

[27]  P. Garthwaite An Interpretation of Partial Least Squares , 1994 .

[28]  T. Fearn,et al.  Bayesian Wavelet Regression on Curves With Application to a Spectroscopic Calibration Problem , 2001 .

[29]  Desire L. Massart,et al.  A comparison of multivariate calibration techniques applied to experimental NIR data sets: Part II. Predictive ability under extrapolation conditions , 2001 .

[30]  P. Brown,et al.  Multivariate Calibration With More Variables Than Observations , 1989 .

[31]  I. Helland Some theoretical aspects of partial least squares regression , 2001 .

[32]  Ronald R. Coifman,et al.  Entropy-based algorithms for best basis selection , 1992, IEEE Trans. Inf. Theory.

[33]  An exact formula for the mean squared error of the inverse estimator in the linear calibration problem , 1985 .

[34]  Anthony Randal McIntosh,et al.  Partial least squares analysis of neuroimaging data: applications and advances , 2004, NeuroImage.

[35]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[36]  A. Höskuldsson PLS regression methods , 1988 .

[37]  Rolf Sundberg,et al.  Multivariate Calibration — Direct and Indirect Regression Methodology , 1999 .

[38]  B. Kowalski,et al.  Theory of analytical chemistry , 1994 .

[39]  Nouna Kettaneh,et al.  Statistical Modeling by Wavelets , 1999, Technometrics.

[40]  David M. Haaland,et al.  Concentration Residual Augmented Classical Least Squares (CRACLS): A Multivariate Calibration Method with Advantages over Partial Least Squares , 2002 .

[41]  C. Braak,et al.  Prediction error in partial least squares regression: a critique on the deviation used in The Unscrambler , 1995 .

[42]  D. Haaland,et al.  Multivariate Least-Squares Methods Applied to the Quantitative Spectral Analysis of Multicomponent Samples , 1985 .

[43]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[44]  Peter D. Wentzell,et al.  Comparison of principal components regression and partial least squares regression through generic simulations of complex mixtures , 2003 .