The influence of data preprocessing on the robustness and parsimony of multivariate calibration models

Abstract Multivariate techniques such as partial least squares and principal component regression have a high modelling power, but the model complexity increases rapidly upon inclusion of non-relevant sources of variance and non-linearities. Proper data preprocessing can eliminate these effects beforehand, which results in more parsimonious models. In terms of the relation between prediction errors and model complexity, the effect of data preprocessing can be explained as a sharper bias decrease upon inclusion of additional model parameters, which is, however, accompanied by a steeper variance increase due to estimation errors. This means that the predictive ability does not necessarily improve, but parsimonious models are expected to be more robust. The above is illustrated by an example from near-infrared spectroscopy of heavy oil products. Multiplicative signal correction, applied to the second derivative spectra, was used for data preprocessing.