Model selection for partial least squares regression

Partial least squares (PLS) regression is a powerful and frequently applied technique in multivariate statistical process control when the process variables are highly correlated. Selection of the number of latent variables to build a representative model is an important issue. A metric frequently used by chemometricians for the determination of the number of latent variables is that of Wold's R criterion, whilst more recently a number of statisticians have advocated the use of Akaike Information Criterion (AIC). In this paper, a comparison between Wold's R criterion and AIC for the selection of the number of latent variables to include in a PLS model that will form the basis of a multivariate statistical process control representation is undertaken based on a simulation study. It is shown that neither Wold's R criterion nor AIC exhibit satisfactory performance. This is in contrast to the adjusted Wold's R criteria which is shown to demonstrate satisfactory performance in terms of the number of times the known true model is selected. Two industrial applications are then used to demonstrate the methodology. The first relates to the modelling of a product quality using data from an industrial fluidised bed reactor and the second focuses on an industrial NIR data set. The results are consistent with those of the simulation studies.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  I. Helland Maximum likelihood regression on relevant components , 1992 .

[3]  Wojtek J. Krzanowski,et al.  Cross-Validation in Principal Component Analysis , 1987 .

[4]  W. Krzanowski,et al.  Cross-Validatory Choice of the Number of Components From a Principal Component Analysis , 1982 .

[5]  Avraham Lorber,et al.  Alternatives to Cross-Validatory Estimation of the Number of Factors in Multivariate Calibration , 1990 .

[6]  H. Akaike Factor analysis and AIC , 1987 .

[7]  B. G. Quinn,et al.  The determination of the order of an autoregression , 1979 .

[8]  A. J. Morris,et al.  Multivariate statistical process control of an industrial fluidised-bed reactor , 2000 .

[9]  D. W. Osten,et al.  Selection of optimal regression models via cross‐validation , 1988 .

[10]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[11]  Chih-Ling Tsai,et al.  MODEL SELECTION FOR MULTIVARIATE REGRESSION IN SMALL SAMPLES , 1994 .

[12]  James B. Ramsey,et al.  Evaluation of Econometric Models , 1980 .

[13]  H. Akaike A new look at the statistical model identification , 1974 .

[14]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[15]  A. J. Morris,et al.  Manufacturing performance enhancement through multivariate statistical process control , 1999 .

[16]  H. Akaike Fitting autoregressive models for prediction , 1969 .

[17]  Tormod Næs,et al.  Comparison of prediction methods for multicollinear data , 1985 .

[18]  Christos Georgakis,et al.  Determination of the number of principal components for disturbance detection and isolation , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[19]  Herman Wold,et al.  Model Construction and Evaluation When Theoretical Knowledge Is Scarce , 1980 .

[20]  A. Höskuldsson PLS regression methods , 1988 .

[21]  Prasad A. Naik,et al.  Partial least squares estimator for single‐index models , 2000 .

[22]  Inge S. Helland,et al.  Relevant components in regression , 1993 .

[23]  Heinz Unbehauen,et al.  Structure identification of nonlinear dynamic systems - A survey on input/output approaches , 1990, Autom..

[24]  T. A. Bancroft,et al.  Research papers in statistics , 1966 .