A new strategy of least absolute shrinkage and selection operator coupled with sampling error profile analysis for wavelength selection

Abstract A new strategy based on sampling error profile analysis (SEPA) combined with least absolute shrinkage and selection operator (SEPA-LASSO) was proposed. LASSO has been proven to be effective for multivariate calibration with automatic variable selection for high-dimensional data. However, in the previous research, the critical process of multivariate calibration by LASSO was an optimization of 1-norm turning parameter for a fixed sample set without considering the behaviors of variable selection by different subsets of samples. In the present work, Monte Carlo Sampling (MCS), the core of SEPA framework, is used to investigate various sub-models. Least angle regression (LAR) is used to solve LASSO, and various LAR iteration including certain number of variables could be obtained instead of choosing the numerical values of 1-norm turning parameters. SEPA-LASSO algorithm consists of plenty of loops. Under the SEPA framework and LAR algorithm, a number of LASSO sub-models with the same dimensions are built by MCS in each loop, the vote rule is used to determine the importance of variables and select them to build variable subsets. After running the loops, several subsets of variables are obtained and their error profile is used to choose the optimal subset of variables. The performance of SEPA-LASSO was evaluated by three near-infrared (NIR) datasets. The results show that the model built by SEPA-LASSO has excellent predictability and interpretability, compared with some commonly used multivariate calibration methods, such as principal component regression (PCR) and partial least squares (PLS), as well as some wavelength selection methods including LASSO, moving window partial least squares regression (MWPLSR), Monte Carlo uninformative variable elimination (MC-UVE), ordered homogeneity pursuit lasso (OHPL) and stability competitive adaptive reweighted sampling (SCARS).

[1]  Romà Tauler,et al.  Application of the local regression method interval partial least-squares to the elucidation of protein secondary structure. , 2005, Analytical biochemistry.

[2]  Ting Wu,et al.  Improvement of NIR model by fractional order Savitzky–Golay derivation (FOSGD) coupled with wavelength selection , 2015 .

[3]  Vincent Baeten,et al.  A Backward Variable Selection method for PLS regression (BVSPLS). , 2009, Analytica chimica acta.

[4]  Qing-Song Xu,et al.  Ordered homogeneity pursuit lasso for group variable selection with applications to spectroscopic data , 2017 .

[5]  Hongdong Li,et al.  Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. , 2009, Analytica chimica acta.

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[8]  Dong-Sheng Cao,et al.  A new strategy of outlier detection for QSAR/QSPR , 2009, J. Comput. Chem..

[9]  Qingsong Xu,et al.  A selective review and comparison for interval variable selection in spectroscopic modeling , 2017 .

[10]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[11]  Xueguang Shao,et al.  Detecting influential observations by cluster analysis and Monte Carlo cross-validation. , 2010, The Analyst.

[12]  Erik Andries,et al.  Spectral Multivariate Calibration with Wavelength Selection Using Variants of Tikhonov Regularization , 2010, Applied spectroscopy.

[13]  John H. Kalivas,et al.  Overview of two‐norm (L2) and one‐norm (L1) Tikhonov regularization variants for full wavelength or sparse spectral multivariate calibration models or maintenance , 2012 .

[14]  Shuangyan Huan,et al.  Preliminary study on the application of near infrared spectroscopy and pattern recognition methods to classify different types of apple samples. , 2011, Food chemistry.

[15]  Nathalie Dupuy,et al.  Automated principal component-based orthogonal signal correction applied to fused near infrared-mid-infrared spectra of French olive oils. , 2009, Analytical chemistry.

[16]  Gerard Downey,et al.  Feasibility study on the use of visible-near-infrared spectroscopy for the screening of individual and total glucosinolate contents in broccoli. , 2012, Journal of agricultural and food chemistry.

[17]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[18]  Dong-Sheng Cao,et al.  A new strategy to prevent over-fitting in partial least squares models based on model population analysis. , 2015, Analytica chimica acta.

[19]  Qing-Song Xu,et al.  PLS regression based on sure independence screening for multivariate calibration , 2012 .

[20]  Gabriele Reich,et al.  Optimization of near-infrared spectroscopic process monitoring at low signal-to-noise ratio. , 2011, Analytical chemistry.

[21]  Dong-Sheng Cao,et al.  A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. , 2014, Analytica chimica acta.

[22]  Riccardo Leardi,et al.  Genetic Algorithms as a Tool for Wavelength Selection in Multivariate Calibration , 1995 .

[23]  S. Wold,et al.  Wavelength interval selection in multicomponent spectral analysis by moving window partial least-squares regression with applications to mid-infrared and near-infrared spectroscopic data. , 2002, Analytical chemistry.

[24]  Rasmus Larsen,et al.  SpaSM: A MATLAB Toolbox for Sparse Statistical Modeling , 2018 .

[25]  Kaiyi Zheng,et al.  Stability competitive adaptive reweighted sampling (SCARS) and its applications to multivariate calibration of NIR spectra , 2012 .

[26]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[27]  Qing-Song Xu,et al.  Using variable combination population analysis for variable selection in multivariate calibration. , 2015, Analytica chimica acta.

[28]  Simon X. Yang,et al.  A comparative study for least angle regression on NIR spectra analysis to determine internal qualities of navel oranges , 2015, Expert Syst. Appl..

[29]  W. Cai,et al.  Outlier detection in near-infrared spectroscopic analysis by using Monte Carlo cross-validation , 2008 .

[30]  Paul J. Gemperline,et al.  Principal components regression for routine multicomponent UV determinations: A validation protocol , 1989 .

[31]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[32]  Yi-Zeng Liang,et al.  Model population analysis in chemometrics , 2015 .

[33]  R. Yu,et al.  An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. , 2008, Analytica chimica acta.

[34]  G. Irwin,et al.  Dynamic inferential estimation using principal components regression (PCR) , 1998 .

[35]  Qing-Song Xu,et al.  Random frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. , 2012, Analytica chimica acta.

[36]  Xueguang Shao,et al.  Rapid and nondestructive analysis of pharmaceutical products using near-infrared diffuse reflectance spectroscopy. , 2012, Journal of pharmaceutical and biomedical analysis.

[37]  Dong-Sheng Cao,et al.  Model population analysis for variable selection , 2010 .

[38]  Qin Xiong,et al.  Sampling error profile analysis (SEPA) for model optimization and model evaluation in multivariate calibration , 2018 .

[39]  Xueguang Shao,et al.  A wavelength selection method based on randomization test for near-infrared spectral analysis , 2009 .

[40]  Yiping Du,et al.  Multivariate calibration of on-line enrichment near-infrared (NIR) spectra and determination of trace lead in water , 2009 .

[41]  Dong-Sheng Cao,et al.  A bootstrapping soft shrinkage approach for variable selection in chemical modeling. , 2016, Analytica chimica acta.

[42]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[43]  Ainara López,et al.  A review of the application of near-infrared spectroscopy for the analysis of potatoes. , 2013, Journal of agricultural and food chemistry.

[44]  Zhenqi Shi,et al.  Scattering orthogonalization of near-infrared spectra for analysis of pharmaceutical tablets. , 2009, Analytical chemistry.

[45]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[46]  Liya Yao,et al.  Rapid measurement of total polyphenols content in cocoa beans by data fusion of NIR spectroscopy and electronic tongue , 2014 .

[47]  Xueguang Shao,et al.  Application of latent projective graph in variable selection for near infrared spectral analysis , 2012 .

[48]  Jian-hui Jiang,et al.  Spectral regions selection to improve prediction ability of PLS models by changeable size moving window partial least squares and searching combination moving window partial least squares , 2004 .

[49]  Károly Héberger,et al.  Wavelength Selection for Multivariate Calibration Using Tikhonov Regularization , 2007, Applied spectroscopy.

[50]  M. Räsänen,et al.  Development and validation of a near-infrared method for the quantitation of caffeine in intact single tablets. , 2003, Analytical chemistry.

[51]  S. Engelsen,et al.  Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy , 2000 .

[52]  Erik Andries,et al.  Model updating for spectral calibration maintenance and transfer using 1-norm variants of Tikhonov regularization. , 2010, Analytical chemistry.