A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling.

In this study, a new optimization algorithm called the Variable Iterative Space Shrinkage Approach (VISSA) that is based on the idea of model population analysis (MPA) is proposed for variable selection. Unlike most of the existing optimization methods for variable selection, VISSA statistically evaluates the performance of variable space in each step of optimization. Weighted binary matrix sampling (WBMS) is proposed to generate sub-models that span the variable subspace. Two rules are highlighted during the optimization procedure. First, the variable space shrinks in each step. Second, the new variable space outperforms the previous one. The second rule, which is rarely satisfied in most of the existing methods, is the core of the VISSA strategy. Compared with some promising variable selection methods such as competitive adaptive reweighted sampling (CARS), Monte Carlo uninformative variable elimination (MCUVE) and iteratively retaining informative variables (IRIV), VISSA showed better prediction ability for the calibration of NIR data. In addition, VISSA is user-friendly; only a few insensitive parameters are needed, and the program terminates automatically without any additional conditions. The Matlab codes for implementing VISSA are freely available on the website: https://sourceforge.net/projects/multivariateanalysis/files/VISSA/.

[1]  M. C. U. Araújo,et al.  The successive projections algorithm for variable selection in spectroscopic multicomponent analysis , 2001 .

[2]  Paul Geladi,et al.  Interactive variable selection (IVS) for PLS. Part 1: Theory and algorithms , 1994 .

[3]  Dong-Sheng Cao,et al.  A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. , 2014, Analytica chimica acta.

[4]  M. Forina,et al.  Transfer of calibration function in near-infrared spectroscopy , 1995 .

[5]  R. Yu,et al.  An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. , 2008, Analytica chimica acta.

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  M. Mørup,et al.  Non-linear calibration models for near infrared spectroscopy. , 2014, Analytica chimica acta.

[8]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[9]  Ke Wang,et al.  Bagging for robust non-linear multivariate calibration of spectroscopy , 2011 .

[10]  Qing-Song Xu,et al.  Random frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for variable selection with applications to gene selection and disease classification. , 2012, Analytica chimica acta.

[11]  Qing-Song Xu,et al.  Support vector machines and its applications in chemistry , 2009 .

[12]  Elaine Martin,et al.  Bayesian linear regression and variable selection for spectroscopic calibration. , 2009, Analytica chimica acta.

[13]  R. Leardi,et al.  Genetic algorithms applied to feature selection in PLS regression: how and when to use them , 1998 .

[14]  John H. Kalivas,et al.  Comparison of Forward Selection, Backward Elimination, and Generalized Simulated Annealing for Variable Selection , 1993 .

[15]  Xueguang Shao,et al.  Multivariate calibration of near-infrared spectra by using influential variables , 2012 .

[16]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[17]  Haiyan Wang,et al.  Improving accuracy for cancer classification with a new algorithm for genes selection , 2012, BMC Bioinformatics.

[18]  Dong-Sheng Cao,et al.  An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration. , 2013, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[19]  Yizeng Liang,et al.  Noise incorporated subwindow permutation analysis for informative gene selection using support vector machines. , 2011, The Analyst.

[20]  W. Cai,et al.  Variable selection based on locally linear embedding mapping for near-infrared spectral analysis , 2014 .

[21]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[22]  Paul Geladi,et al.  Interactive variable selection (IVS) for PLS. Part II: Chemical applications , 1995 .

[23]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[24]  S. Wold,et al.  Wavelength interval selection in multicomponent spectral analysis by moving window partial least-squares regression with applications to mid-infrared and near-infrared spectroscopic data. , 2002, Analytical chemistry.

[25]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[26]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[27]  Hongdong Li,et al.  Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. , 2009, Analytica chimica acta.

[28]  Jian-Hui Jiang,et al.  Optimized Partition of Minimum Spanning Tree for Piecewise Modeling by Particle Swarm Algorithm. QSAR Studies of Antagonism of Angiotensin II Antagonists , 2004, J. Chem. Inf. Model..

[29]  Knut Baumann,et al.  Cross-validation as the objective function for variable-selection techniques , 2003 .

[30]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[31]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[32]  Roberto Kawakami Harrop Galvão,et al.  A variable elimination method to improve the parsimony of MLR models using the successive projections algorithm , 2008 .

[33]  I. Jolliffe Principal Component Analysis , 2002 .

[34]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[35]  Xueguang Shao,et al.  Application of latent projective graph in variable selection for near infrared spectral analysis , 2012 .

[36]  Dong-Sheng Cao,et al.  Recipe for revealing informative metabolites based on model population analysis , 2010, Metabolomics.

[37]  Riccardo Leardi,et al.  Application of genetic algorithm–PLS for feature selection in spectral data sets , 2000 .

[38]  Dong-Sheng Cao,et al.  Model-population analysis and its applications in chemical and biological modeling , 2012 .

[39]  Dong-Sheng Cao,et al.  Recipe for uncovering predictive genes using support vector machines based on model population analysis , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  H. Martens,et al.  Near-Infrared Absorption and Scattering Separated by Extended Inverted Signal Correction (EISC): Analysis of Near-Infrared Transmittance Spectra of Single Wheat Seeds , 2002 .