Iteratively variable subset optimization for multivariate calibration

Based on the theory that a large partial least squares (PLS) regression coefficient on autoscaled data indicates an important variable, a novel strategy for variable selection called iteratively variable subset optimization (IVSO) is proposed in this study. In addition, we take into consideration that the optimal number of latent variables generated by cross-validation will make a great difference to the regression coefficients and sometimes the difference can even vary by several orders of magnitude. In this work, the regression coefficients generated in every sub-model are normalized to remove the influence. In each iterative round, the regression coefficients of each variable obtained from the sub-models are summed to evaluate their importance level. A two-step procedure including weighted binary matrix sampling (WBMS) and sequential addition is employed to eliminate uninformative variables gradually and gently in a competitive way and reduce the risk of losing important variables. Thus, IVSO can achieve high stability. Investigated by using one simulated dataset and two NIR datasets, IVSO shows much better prediction ability than two other outstanding and commonly used methods, Monte Carlo uninformative variable elimination (MC-UVE) and competitive adaptive reweighted sampling (CARS). The MATLAB code for implementing IVSO is available in the ESI.

[1]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[2]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[3]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[4]  John H. Kalivas,et al.  Global optimization by simulated annealing with wavelength selection for ultraviolet-visible spectrophotometry , 1989 .

[5]  Lunzhao Yi,et al.  A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling. , 2014, The Analyst.

[6]  A. G. Frenich,et al.  Wavelength selection method for multicomponent spectrophotometric determinations using partial least squares , 1995 .

[7]  Yong-Huan Yun,et al.  A new method for wavelength interval selection that intelligently optimizes the locations, widths and combinations of the intervals. , 2015, The Analyst.

[8]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[9]  M. C. U. Araújo,et al.  The successive projections algorithm for variable selection in spectroscopic multicomponent analysis , 2001 .

[10]  Qing-Song Xu,et al.  Using variable combination population analysis for variable selection in multivariate calibration. , 2015, Analytica chimica acta.

[11]  Olav M. Kvalheim,et al.  Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots , 2010 .

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  C. Spiegelman,et al.  Theoretical Justification of Wavelength Selection in PLS Calibration:  Development of a New Algorithm. , 1998, Analytical Chemistry.

[14]  Franco Allegrini,et al.  A new and efficient variable selection algorithm based on ant colony optimization. Applications to near infrared spectroscopy/partial least-squares analysis. , 2011, Analytica chimica acta.

[15]  Parham Moradi,et al.  Relevance-redundancy feature selection based on ant colony optimization , 2015, Pattern Recognit..

[16]  Xiaoyan Xiong,et al.  A novel hybrid system for feature selection based on an improved gravitational search algorithm and k-NN method , 2015, Appl. Soft Comput..

[17]  Rasmus Bro,et al.  Variable selection in regression—a tutorial , 2010 .

[18]  Hongdong Li,et al.  Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. , 2009, Analytica chimica acta.

[19]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[20]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[21]  Leandro dos Santos Coelho,et al.  Firefly as a novel swarm intelligence variable selection method in spectroscopy. , 2014, Analytica chimica acta.

[22]  Haiyan Wang,et al.  Improving accuracy for cancer classification with a new algorithm for genes selection , 2012, BMC Bioinformatics.

[23]  Riccardo Leardi,et al.  Application of genetic algorithm–PLS for feature selection in spectral data sets , 2000 .

[24]  Dong-Sheng Cao,et al.  A new strategy to prevent over-fitting in partial least squares models based on model population analysis. , 2015, Analytica chimica acta.

[25]  J. Kalivas Two data sets of near infrared spectra , 1997 .

[26]  R. Yu,et al.  An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. , 2008, Analytica chimica acta.

[27]  Dong-Sheng Cao,et al.  A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration , 2014 .

[28]  R. Leardi,et al.  Genetic algorithms applied to feature selection in PLS regression: how and when to use them , 1998 .

[29]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[30]  Dong-Sheng Cao,et al.  An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration. , 2013, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[31]  Desire L. Massart,et al.  Comparison of multivariate methods based on latent vectors and methods based on wavelength selection for the analysis of near-infrared spectroscopic data , 1995 .

[32]  Stefania Favilla,et al.  Assessing feature relevance in NPLS models by VIP , 2013 .

[33]  Dong-Sheng Cao,et al.  A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. , 2014, Analytica chimica acta.

[34]  Yizeng Liang,et al.  A perspective demonstration on the importance of variable selection in inverse calibration for complex analytical systems. , 2013, The Analyst.

[35]  Age K. Smilde,et al.  Variable importance in latent variable regression models , 2014 .

[36]  S. Wold,et al.  Wavelength interval selection in multicomponent spectral analysis by moving window partial least-squares regression with applications to mid-infrared and near-infrared spectroscopic data. , 2002, Analytical chemistry.

[37]  Qing-Song Xu,et al.  Generalized PLS regression , 2001 .