Robust Wavelength Selection Using Filter-Wrapper Method and Input Scaling on Near Infrared Spectral Data †

The extraction of relevant wavelengths from a large dataset of Near Infrared Spectroscopy (NIRS) is a significant challenge in vibrational spectroscopy research. Nonetheless, this process allows the improvement in the chemical interpretability by emphasizing the chemical entities related to the chemical parameters of samples. With the complexity in the dataset, it may be possible that irrelevant wavelengths are still included in the multivariate calibration. This yields the computational process to become unnecessary complex and decreases the accuracy and robustness of the model. In multivariate analysis, Partial Least Square Regression (PLSR) is a method commonly used to build a predictive model from NIR spectral data. However, in the PLSR method and common commercial chemometrics software, there is no standard wavelength selection procedure applied to screen the irrelevant wavelengths. In this study, a new robust wavelength selection procedure called the modified VIP-MCUVE (mod-VIP-MCUVE) using Filter-Wrapper method and input scaling strategy is introduced. The proposed method combines the modified Variable Importance in Projection (VIP) and modified Monte Carlo Uninformative Variable Elimination (MCUVE) to calculate the scale matrix of the input variable. The modified VIP uses the orthogonal components of Partial Least Square (PLS) in investigating the informative variable in the model by applying the amount of variation both in X and y{SSX,SSY}, simultaneously. The modified MCUVE uses a robust reliability coefficient and a robust tolerance interval in the selection procedure. To evaluate the superiority of the proposed method, the classical VIP, MCUVE, and autoscaling procedure in classical PLSR were also included in the evaluation. Using artificial data with Monte Carlo simulation and NIR spectral data of oil palm (Elaeis guineensis Jacq.) fruit mesocarp, the study shows that the proposed method offers advantages to improve model interpretability, to be computationally extensive, and to produce better model accuracy.

[1]  C. Jun,et al.  Performance of some variable selection methods when multicollinearity is present , 2005 .

[2]  Giuseppe Palermo,et al.  Performance of PLS regression coefficients in selecting variables for each response of a multivariate PLS for omics-type data , 2009, Advances and applications in bioinformatics and chemistry : AABC.

[3]  M. Forina,et al.  Iterative predictor weighting (IPW) PLS: a technique for the elimination of useless predictors in regression problems , 1999 .

[4]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[5]  R. Clark,et al.  Spectroscopic Determination of Leaf Biochemistry Using Band-Depth Analysis of Absorption Features and Stepwise Multiple Linear Regression , 1999 .

[6]  Manabu Kano,et al.  Development of soft-sensor using locally weighted PLS with adaptive similarity measure , 2013 .

[7]  Johan Trygg,et al.  Variable influence on projection (VIP) for orthogonal projections to latent structures (OPLS) , 2014 .

[8]  Bjørn K. Alsberg,et al.  A framework for significance analysis of gene expression data using dimension reduction methods , 2007, BMC Bioinformatics.

[9]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[10]  Harald Martens,et al.  A Partial Least Squares based algorithm for parsimonious variable selection , 2011, Algorithms for Molecular Biology.

[11]  Jaehoon Kim,et al.  An adaptive unscented Kalman filtering approach using selective scaling , 2016, 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[12]  R. Leardi,et al.  Genetic algorithms applied to feature selection in PLS regression: how and when to use them , 1998 .

[13]  Age K. Smilde,et al.  Simplivariate Models: Uncovering the Underlying Biology in Functional Genomics Data , 2011, PloS one.

[14]  Yukihiro Ozaki,et al.  A Feasibility Study on Non-Destructive Determination of Oil Content in Palm Fruits by Visible–Near Infrared Spectroscopy , 2012 .

[15]  Vincent Baeten,et al.  Oil and Fat Classification by Selected Bands of Near-Infrared Spectroscopy , 2000 .

[16]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[17]  C. H. Lee,et al.  critical reexamination of the method of bunch quality analysis in oil palm breeding , 1983 .

[18]  Vincent Baeten,et al.  Application of low-resolution Raman spectroscopy for the analysis of oxidized olive oil , 2011 .

[19]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[20]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[21]  Jin Wang,et al.  Comparison of variable selection methods for PLS-based soft sensor modeling , 2015 .

[22]  Romà Tauler,et al.  Detection of Olive Oil Adulteration Using FT-IR Spectroscopy and PLS with Variable Importance of Projection (VIP) Scores , 2012 .

[23]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[24]  Alessandro Orso,et al.  Scaling regression testing to large software systems , 2004, SIGSOFT '04/FSE-12.

[25]  A. Gelman Scaling regression inputs by dividing by two standard deviations , 2008, Statistics in medicine.

[26]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[27]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[28]  Rasmus Bro,et al.  Variable selection in regression—a tutorial , 2010 .

[29]  B. Stuart Infrared Spectroscopy , 2004, Analytical Techniques in Forensic Science.

[30]  S. Wold,et al.  Orthogonal projections to latent structures (O‐PLS) , 2002 .

[31]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[32]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[33]  Robert A. Schowengerdt,et al.  Remote sensing, models, and methods for image processing , 1997 .

[34]  Katherine A. Bakeev Process analytical technology : spectroscopic tools and implementation strategies for the chemical and pharmaceutical industries , 2010 .

[35]  Beata Walczak,et al.  Spectral transformation and wavelength selection in near-infrared spectra classification , 1995 .

[36]  OrsoAlessandro,et al.  Scaling regression testing to large software systems , 2004 .

[37]  R. Yu,et al.  An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. , 2008, Analytica chimica acta.

[38]  S. Wold,et al.  PLS: Partial Least Squares Projections to Latent Structures , 1993 .