Kernel Partial Least Square Regression with High Resistance to Multiple Outliers and Bad Leverage Points on Near-Infrared Spectral Data Analysis

Multivariate statistical analysis such as partial least square regression (PLSR) is the common data processing technique used to handle high-dimensional data space on near-infrared (NIR) spectral datasets. The PLSR is useful to tackle the multicollinearity and heteroscedasticity problem that can be commonly found in such data space. With the problem of the nonlinear structure in the original input space, the use of the classical PLSR model might not be appropriate. In addition, the contamination of multiple outliers and high leverage points (HLPs) in the dataset could further damage the model. Generally, HLPs contain both good leverage points (GLPs) and bad leverage points (BLPs); therefore, in this case, removing the BLPs seems relevant since it has a significant impact on the parameter estimates and can slow down the convergence process. On the other hand, the GLPs provide a good efficiency in the model calibration process; thus, they should not be eliminated. In this study, robust alternatives to the existing kernel partial least square (KPLS) regression, which are called the kernel partial robust GM6-estimator (KPRGM6) regression and the kernel partial robust modified GM6-estimator (KPRMGM6) regression are introduced. The nonlinear solution on PLSR was handled through kernel-based learning by nonlinearly projecting the original input data matrix into a high-dimensional feature mapping that corresponded to the reproducing kernel Hilbert spaces (RKHS). To increase the robustness, the improvements on GM6 estimators are presented with the nonlinear PLSR. Based on the investigation using several artificial dataset scenarios from Monte Carlo simulations and two sets from the near-infrared (NIR) spectral dataset, the proposed robust KPRMGM6 is found to be superior to the robust KPRGM6 and non-robust KPLS.

[1]  Peter Filzmoser,et al.  Partial robust M-regression , 2005 .

[2]  Budiman Minasny,et al.  Why you don't need to use RPD , 2013 .

[3]  Roman M. Balabin,et al.  Comparison of linear and nonlinear calibration models based on near infrared (NIR) spectroscopy data for gasoline properties prediction , 2007 .

[4]  Zhizhong Mao,et al.  Kernel partial robust M-regression as a flexible robust nonlinear modeling technique , 2010 .

[5]  A. Imon,et al.  A New Robust Diagnostic Plot for Classifying Good and Bad High Leverage Points in a Multiple Linear Regression Model , 2015 .

[6]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .

[7]  Jan-Egbert Sturm,et al.  No Need to Run Millions of Regressions , 2000, SSRN Electronic Journal.

[8]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[9]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[10]  David J. Cummins,et al.  Iteratively reweighted partial least squares: A performance analysis by monte carlo simulation , 1995 .

[11]  Habshah Midi,et al.  Robust Wavelength Selection Using Filter-Wrapper Method and Input Scaling on Near Infrared Spectral Data † , 2020, Sensors.

[12]  Roman Rosipal,et al.  Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space , 2002, J. Mach. Learn. Res..

[13]  Habshah Midi,et al.  Diagnostic plot for the identification of high leverage collinearity-influential observations , 2015 .

[14]  R. Rosipal Nonlinear Partial Least Squares An Overview , 2011 .

[15]  C. W. Coakley,et al.  A Bounded Influence, High Breakdown, Efficient Regression Estimator , 1993 .

[16]  C. Preda Regression models for functional data by reproducing kernel Hilbert spaces methods , 2007 .

[17]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[18]  Hui Cao,et al.  Nonlinear Regression with High-Dimensional Space Mapping for Blood Component Spectral Quantitative Analysis , 2018 .

[19]  Habshah Midi,et al.  Diagnostic Robust Generalized Potential Based on Index Set Equality (DRGP (ISE)) for the identification of high leverage points in linear model , 2016, Comput. Stat..

[20]  P. Rousseeuw,et al.  Alternatives to the Median Absolute Deviation , 1993 .

[21]  J. D. Tate,et al.  Comparison of partial least squares regression and multi-layer neural networks for quantification of nonlinear systems and application to gas phase Fourier transform infrared spectra , 2003 .

[22]  M. R. Norazan,et al.  The performance of diagnostic-robust generalized potentials for the identification of multiple high leverage points in linear regression , 2009 .

[23]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[24]  A. Atkinson Fast Very Robust Methods for the Detection of Multiple Outliers , 1994 .

[25]  S. Wold,et al.  A PLS kernel algorithm for data sets with many variables and fewer objects. Part 1: Theory and algorithm , 1994 .

[26]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[27]  Kristin P. Bennett,et al.  An Optimization Perspective on Kernel Partial Least Squares Regression , 2003 .

[28]  Habshah Midi,et al.  Fast and Robust Diagnostic Technique for the Detection of High Leverage Points , 2020, Pertanika Journal of Science and Technology.

[29]  H. Midi Robust Estimation of a Linearized Nonlinear Regression Modelwith Heteroscedastic Errors:A Simulation Study , 1998 .

[30]  Habshah Midi,et al.  Kernel partial diagnostic robust potential to handle high-dimensional and irregular data space on near infrared spectral data , 2020, Heliyon.

[31]  H. Wold Soft Modelling by Latent Variables: The Non-Linear Iterative Partial Least Squares (NIPALS) Approach , 1975, Journal of Applied Probability.