Interpolation and extrapolation problems of multivariate regression in analytical chemistry: benchmarking the robustness on near-infrared (NIR) spectroscopy data.

Modern analytical chemistry of industrial products is in need of rapid, robust, and cheap analytical methods to continuously monitor product quality parameters. For this reason, spectroscopic methods are often used to control the quality of industrial products in an on-line/in-line regime. Vibrational spectroscopy, including mid-infrared (MIR), Raman, and near-infrared (NIR), is one of the best ways to obtain information about the chemical structures and the quality coefficients of multicomponent mixtures. Together with chemometric algorithms and multivariate data analysis (MDA) methods, which were especially created for the analysis of complicated, noisy, and overlapping signals, NIR spectroscopy shows great results in terms of its accuracy, including classical prediction error, RMSEP. However, it is unclear whether the combined NIR + MDA methods are capable of dealing with much more complex interpolation or extrapolation problems that are inevitably present in real-world applications. In the current study, we try to make a rather general comparison of linear, such as partial least squares or projection to latent structures (PLS); "quasi-nonlinear", such as the polynomial version of PLS (Poly-PLS); and intrinsically non-linear, such as artificial neural networks (ANNs), support vector regression (SVR), and least-squares support vector machines (LS-SVM/LSSVM), regression methods in terms of their robustness. As a measure of robustness, we will try to estimate their accuracy when solving interpolation and extrapolation problems. Petroleum and biofuel (biodiesel) systems were chosen as representative examples of real-world samples. Six very different chemical systems that differed in complexity, composition, structure, and properties were studied; these systems were gasoline, ethanol-gasoline biofuel, diesel fuel, aromatic solutions of petroleum macromolecules, petroleum resins in benzene, and biodiesel. Eighteen different sample sets were used in total. General conclusions are made about the applicability of ANN- and SVM-based regression tools in the modern analytical chemistry. The effectiveness of different multivariate algorithms is different when going from classical accuracy to robustness. Neural networks, which are capable of producing very accurate results with respect to classical RMSEP, are not able to solve interpolation problems or, especially, extrapolation problems. The chemometric methods that are based on the support vector machine (SVM) ideology are capable of solving both classical regression and interpolation/extrapolation tasks.

[1]  L. Buydens,et al.  Multivariate calibration with least-squares support vector machines. , 2004, Analytical chemistry.

[2]  J. Roger,et al.  Application of LS-SVM to non-linear phenomena in NIR spectroscopy: development of a robust and portable sensor for acidity prediction in grapes , 2004 .

[3]  Trent R Northen,et al.  Rapid screening of fatty acids using nanostructure-initiator mass spectrometry. , 2010, Analytical chemistry.

[4]  Roman M. Balabin,et al.  Frequency Dependence of Oil Conductivity at High Pressure , 2007 .

[5]  Robert P. Cogdill,et al.  Least-Squares Support Vector Machines for Chemometrics: An Introduction and Evaluation , 2004 .

[6]  Roman M. Balabin,et al.  Tautomeric equilibrium and hydrogen shifts in tetrazole and triazoles: focal-point analysis and ab initio limit. , 2009, The Journal of chemical physics.

[7]  Roman M. Balabin,et al.  Petroleum resins adsorption onto quartz sand: near infrared (NIR) spectroscopy study. , 2008, Journal of colloid and interface science.

[8]  Bernhard Lendl,et al.  Stand-off Raman spectroscopy , 2009 .

[9]  Roman M. Balabin,et al.  Wavelet neural network (WNN) approach for calibration model building based on gasoline near infrared (NIR) spectra , 2008 .

[10]  O. Wolfbeis,et al.  Optical sensing of pH using thin films of substituted polyanilines , 1997 .

[11]  Liguang Xu,et al.  Analytical methods and recent developments in the detection of melamine , 2010 .

[12]  D Brynn Hibbert,et al.  Determination of the composition of fatty acid mixtures using GC x FI-MS: a comprehensive two-dimensional separation approach. , 2009, Analytical chemistry.

[13]  M Karplus,et al.  Evolutionary optimization in quantitative structure-activity relationship: an application of genetic neural networks. , 1996, Journal of medicinal chemistry.

[14]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[15]  Roman M. Balabin,et al.  Gasoline classification using near infrared (NIR) spectroscopy data: comparison of multivariate techniques. , 2010, Analytica chimica acta.

[16]  L. Buydens,et al.  Comparing support vector machines to PLS for spectral regression applications , 2004 .

[17]  Roman M. Balabin,et al.  Polarization of Fluorescence of Asphaltene Containing Systems , 2008 .

[18]  Roman M. Balabin,et al.  Comparison of linear and nonlinear calibration models based on near infrared (NIR) spectroscopy data for gasoline properties prediction , 2007 .

[19]  Marcos N Eberlin,et al.  Single-shot biodiesel analysis: nearly instantaneous typification and quality control solely by ambient mass spectrometry. , 2008, Analytical chemistry.

[20]  Roman M. Balabin,et al.  Neural network approach to quantum-chemistry data: accurate prediction of density functional theory energies. , 2009, The Journal of chemical physics.

[21]  Roman M. Balabin Dispersed Structure of Ethanol‐Gasoline Fuel According to Dynamic Light Scattering Method , 2008 .

[22]  Joaquim P Cardoso,et al.  Applying Near‐Infrared Spectroscopy in Downstream Processing: One Calibration for Multiple Clarification Processes of Fermentation Media , 2008, Biotechnology progress.

[23]  Roman M. Balabin,et al.  Support vector machine regression (LS-SVM)--an alternative to artificial neural networks (ANNs) for the analysis of quantum chemistry data? , 2011, Physical chemistry chemical physics : PCCP.

[24]  Roman M. Balabin,et al.  Variable selection in near-infrared spectroscopy: benchmarking of feature selection methods on biodiesel data. , 2011, Analytica chimica acta.

[25]  Steven A. Benner,et al.  Synthesis and tautomeric equilibrium of 6-amino-5-benzyl-3-methylpyrazin-2-one. An acceptor-donor-donor nucleoside base analog , 1993 .

[26]  Roman M. Balabin,et al.  Near-Infrared (NIR) Spectroscopy for Biodiesel Analysis: Fractional Composition, Iodine Value, and Cold Filter Plugging Point from One Vibrational Spectrum , 2011 .

[27]  Roman M. Balabin,et al.  Capabilities of near Infrared Spectroscopy for the Determination of Petroleum Macromolecule Content in Aromatic Solutions , 2007 .

[28]  Roman M. Balabin,et al.  Neural network (ANN) approach to biodiesel analysis: Analysis of biodiesel density, kinematic viscosity, methanol and water contents using near infrared (NIR) spectroscopy , 2011 .

[29]  K. Brudzewski,et al.  Gasoline quality prediction using gas chromatography and FTIR spectroscopy: An artificial intelligence approach , 2006 .

[30]  L. Buydens,et al.  Visualization and recovery of the (bio)chemical interesting variables in data analysis with support vector machine classification. , 2010, Analytical chemistry.

[31]  Michael S Feld,et al.  Development of robust calibration models using support vector machines for spectroscopic monitoring of blood glucose. , 2010, Analytical chemistry.

[32]  Roman M. Balabin,et al.  Support vector machine regression (SVR/LS-SVM)--an alternative to neural networks (ANN) for analytical chemistry? Comparison of nonlinear methods on near infrared (NIR) spectroscopy data. , 2011, The Analyst.

[33]  Miguel de la Guardia,et al.  Vibrational spectroscopy provides a green tool for multi-component analysis , 2010 .