Investigating the need for preprocessing of near-infrared spectroscopic data as a function of sample size

Abstract Preprocessing of near-infrared (NIR) spectra is an essential part of multivariate calibration. It mainly aims to remove artefacts caused during measurement to improve prediction performance or interpretation. However, preprocessing can have undesired side-effects. Additionally, calibration algorithms can learn to deal with artefacts by themselves when enough samples are available. This may influence the effect preprocessing has on prediction performance when the calibration dataset size increases. In this paper we investigate the interaction between the size of the calibration data and preprocessing for NIR calibrations for several datasets. Results show that extending the calibration data with more samples improves prediction performance, regardless of the preprocessing strategy. Although prediction performance almost always benefits from preprocessing, extending the calibration data can reduce the effect of preprocessing on prediction performance. This means the optimal preprocessing strategy may change as a function of the number of samples. It is demonstrated that using a Design of Experiments (DoE) approach to determine the optimal preprocessing strategy leads to equal or better prediction performance for all calibration set sizes compared to the case of not preprocessing at all. Preprocessing is most valuable for small calibration sets, but as the calibration set increases can become obsolete or even harmful. Therefore, we recommend to always evaluate the effect of a preprocessing strategy before making or updating calibration models.

[1]  D. F. Swinehart,et al.  The Beer-Lambert Law , 1962 .

[2]  Marcelo Blanco,et al.  NIR spectroscopy: a rapid-response analytical tool , 2002 .

[3]  Lutgarde M. C. Buydens,et al.  Breaking with trends in pre-processing? , 2013 .

[4]  M. Dyrby,et al.  Chemometric Quantitation of the Active Substance (Containing C≡N) in a Pharmaceutical Tablet Using Near-Infrared (NIR) Transmittance and NIR FT-Raman Spectra , 2002 .

[5]  Colm P. O'Donnell,et al.  Preventing over‐fitting in PLS calibration models of near‐infrared (NIR) spectroscopy data using regression coefficients , 2011 .

[6]  M. M. Ferreira,et al.  Simultaneously calibrating solids, sugars and acidity of tomato products using PLS2 and NIR spectroscopy. , 2007, Analytica chimica acta.

[7]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Harald Martens,et al.  A multivariate calibration problem in analytical chemistry solved by partial least-squares models in latent variables , 1983 .

[9]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[10]  Elena Marchiori,et al.  Convolutional neural networks for vibrational spectroscopic data analysis. , 2017, Analytica chimica acta.

[11]  Frans van den Berg,et al.  Review of the most common pre-processing techniques for near-infrared spectra , 2009 .

[12]  Jan Gerretzen,et al.  Simple and Effective Way for Data Preprocessing Selection Based on Design of Experiments. , 2015, Analytical chemistry.

[13]  R. Sanderson,et al.  The Link between Multiplicative Scatter Correction (MSC) and Standard Normal Variate (SNV) Transformations of NIR Spectra , 1994 .

[14]  Rekha Gautam,et al.  Review of multidimensional data processing approaches for Raman and infrared spectroscopy , 2015, EPJ Techniques and Instrumentation.

[15]  S Ebel,et al.  Application of NIR reflectance spectroscopy for the identification of pharmaceutical excipients , 2000 .

[16]  Bruce R. Kowalski,et al.  The effect of mean centering on prediction in multivariate calibration , 1992 .

[17]  N. M. Faber,et al.  How to avoid over-fitting in multivariate calibration--the conventional validation approach and an alternative. , 2007, Analytica chimica acta.