SS-DAC: A systematic framework for selecting the best modeling approach and pre-processing for spectroscopic data

Abstract Selecting the best combination of pre-processing and modeling methodologies is a critical activity while building soft sensors from spectroscopic Process Analytical Technology data. To help practitioners in this task, a framework for Soft Sensor Development, Assessment and Comparison (SS-DAC) is proposed. By applying SS-DAC, the following goals are achieved: the models’ parameters are set to optimal performance; the models’ predictive performance is statistically compared against each other; the statistical analyses are summarized through graphical displays. SS-DAC is designed to accommodate different pre-processing and modeling methodologies, and it can be applied to any type of linear and nonlinear models. To exemplify the application of SS-DAC, 13 pre-processing and three modeling methodologies were compared on three real case studies over a total of 13 response properties. For all examples, SS-DAC provided useful insights on the best subset of models and highlighted trends on the impacts of the different pre-processing and modeling methodologies.

[1]  Elaine B. Martin,et al.  CALIBRATION OF SPECTROSCOPIC SENSORS WITH GAUSSIAN PROCESS AND VARIABLE SELECTION , 2007 .

[2]  P. Geladi,et al.  Linearization and Scatter-Correction for Near-Infrared Reflectance Spectra of Meat , 1985 .

[3]  S. Wold,et al.  Orthogonal signal correction of near-infrared spectra , 1998 .

[4]  Douglas A. Wolfe,et al.  Nonparametric Statistical Methods , 1973 .

[5]  Ewa Szymańska,et al.  Modern data science for analytical chemical data - A comprehensive review. , 2018, Analytica chimica acta.

[6]  A. Höskuldsson Variable and subset selection in PLS regression , 2001 .

[7]  Luis Martí-Bonmatí,et al.  Biomarker comparison and selection for prostate cancer detection in Dynamic Contrast Enhanced-Magnetic Resonance Imaging (DCE-MRI) , 2017 .

[8]  Jan Gerretzen,et al.  Simple and Effective Way for Data Preprocessing Selection Based on Design of Experiments. , 2015, Analytical chemistry.

[9]  Peter Goos,et al.  Robust preprocessing and model selection for spectral data , 2012 .

[10]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[11]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[12]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[13]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[14]  Frans van den Berg,et al.  Review of the most common pre-processing techniques for near-infrared spectra , 2009 .

[15]  Ludovic Duponchel,et al.  Parallel genetic algorithm co-optimization of spectral pre-processing and wavelength selection for PLS regression , 2011 .

[16]  Roman M. Balabin,et al.  Support vector machine regression (SVR/LS-SVM)--an alternative to neural networks (ANN) for analytical chemistry? Comparison of nonlinear methods on near infrared (NIR) spectroscopy data. , 2011, The Analyst.

[17]  Geir Rune Flåten,et al.  Using design of experiments to select optimum calibration model parameters , 2003 .

[18]  R. Barnes,et al.  Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance Spectra , 1989 .

[19]  Jiangtao Peng,et al.  Near-infrared calibration transfer based on spectral regression. , 2011, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[20]  Rasmus Bro,et al.  Variable selection in regression—a tutorial , 2010 .

[21]  Tormod Næs,et al.  A user-friendly guide to multivariate calibration and classification , 2002 .

[22]  Rasmus Bro,et al.  Exploring the phenotypic expression of a regulatory proteome-altering gene by spectroscopy and chemometrics , 2001 .

[23]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[24]  S. Engelsen,et al.  Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy , 2000 .

[25]  Y. Roggo,et al.  A review of near infrared spectroscopy and chemometrics in pharmaceutical technologies. , 2007, Journal of pharmaceutical and biomedical analysis.

[26]  Jiewen Zhao,et al.  Selection of the efficient wavelength regions in FT-NIR spectroscopy for determination of SSC of ‘Fuji’ apple based on BiPLS and FiPLS models , 2007 .

[27]  Zou Xiaobo,et al.  Variables selection methods in near-infrared spectroscopy. , 2010, Analytica chimica acta.

[28]  Alberto Ferrer,et al.  Comparison of latent variable‐based and artificial intelligence methods for impurity detection in PET recycling from NIR hyperspectral images , 2018 .

[29]  Chen Li,et al.  Optimal preprocessing of serum and urine metabolomic data fusion for staging prostate cancer through design of experiment. , 2017, Analytica chimica acta.

[30]  Ricardo R. Rendall,et al.  Advanced predictive methods for wine age prediction: Part I - A comparison study of single-block regression approaches based on variable selection, penalized regression, latent variables and tree-based ensemble methods. , 2017, Talanta.

[31]  J. Macgregor,et al.  An investigation of orthogonal signal correction algorithms and their characteristics , 2002 .

[32]  Lutgarde M. C. Buydens,et al.  Breaking with trends in pre-processing? , 2013 .

[33]  David W. Hopkins,et al.  Shoot-out 2002: Transfer of Calibration for Content of Active in a Pharmaceutical Tablet , 2003 .

[34]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[35]  Zou Xiaobo,et al.  Use of FT-NIR spectrometry in non-invasive measurements of soluble solid contents (SSC) of ‘Fuji’ apple based on different PLS models , 2007 .

[36]  Ludovic Duponchel,et al.  Simultaneous data pre-processing and SVM classification model selection based on a parallel genetic algorithm applied to spectroscopic data of olive oils. , 2014, Food chemistry.

[37]  I. Jolliffe Principal Component Analysis , 2002 .

[38]  Roman M. Balabin,et al.  Variable selection in near-infrared spectroscopy: benchmarking of feature selection methods on biodiesel data. , 2011, Analytica chimica acta.

[39]  C. Pasquini Near infrared spectroscopy: A mature analytical technique with new perspectives - A review. , 2018, Analytica chimica acta.

[40]  R. Leardi,et al.  Sequential application of backward interval partial least squares and genetic algorithms for the selection of relevant spectral regions , 2004 .

[41]  Roman M. Balabin,et al.  Comparison of linear and nonlinear calibration models based on near infrared (NIR) spectroscopy data for gasoline properties prediction , 2007 .

[42]  Tiago J. Rato,et al.  Multiresolution interval partial least squares: A framework for waveband selection and resolution optimization , 2019, Chemometrics and Intelligent Laboratory Systems.

[43]  Geert Gins,et al.  Finding the optimal time resolution for batch-end quality prediction: MRQP – A framework for multi-resolution quality prediction , 2018 .

[44]  Jian-hui Jiang,et al.  MCCV stacked regression for model combination and fast spectral interval selection in multivariate calibration , 2007 .