Reproducibility, complementary measure of predictability for robustness improvement of multivariate calibration models via variable selections.

In multivariate calibration with the spectral dataset, variable selection is often applied to identify relevant subset of variables, leading to improved prediction accuracy and easy interpretation of the selected fingerprint regions. Until now, numerous variable selection methods have been proposed, but a proper choice among them is not trivial. Furthermore, in many cases, a set of variables found by those methods might not be robust due to the irreproducibility and uncertainty issues, posing a great challenge in improving the reliability of the variable selection. In this study, the reproducibility of the 5 variable selection methods was investigated quantitatively for evaluating their performance. The reproducibility of variable selection was quantified by using Monte-Carlo sub-sampling (MCS) techniques together with the quantitative similarity measure designed for the highly collinear spectral dataset. The investigation of reproducibility and prediction accuracy of the several variable selection algorithms with two different near-infrared (NIR) datasets illustrated that the different variable selection methods exhibited wide variability in their performance, especially in their capabilities to identify the consistent subset of variables from the spectral datasets. Thus the thorough assessment of the reproducibility together with the predictive accuracy of the identified variables improved the statistical validity and confidence of the selection outcome, which cannot be addressed by the conventional evaluation schemes.

[1]  O. Khalil,et al.  Spectroscopic and clinical aspects of noninvasive glucose measurements. , 1999, Clinical chemistry.

[2]  Yvan Vander Heyden,et al.  Improved variable reduction in partial least squares modelling based on predictive-property-ranked variables and adaptation of partial least squares complexity. , 2011, Analytica chimica acta.

[3]  Hai Hu,et al.  Efficiency Analysis of Competing Tests for Finding Differentially Expressed Genes in Lung Adenocarcinoma , 2008, Cancer informatics.

[4]  R. Leardi,et al.  Sequential application of backward interval partial least squares and genetic algorithms for the selection of relevant spectral regions , 2004 .

[5]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[6]  S. Engelsen,et al.  Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy , 2000 .

[7]  P. Roy,et al.  Exploring the impact of size of training sets for the development of predictive QSAR models , 2008 .

[8]  Riccardo Leardi,et al.  Application of genetic algorithm–PLS for feature selection in spectral data sets , 2000 .

[9]  Andrew G. Glen,et al.  APPL , 2001 .

[10]  Jun Chen,et al.  Molar Absorptivities of Glucose and other Biological Molecules in Aqueous Solutions over the First Overtone and Combination Regions of the Near-Infrared Spectrum , 2004, Applied spectroscopy.

[11]  Zou Xiaobo,et al.  Variables selection methods in near-infrared spectroscopy. , 2010, Analytica chimica acta.

[12]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[13]  R. Leardi,et al.  Variable selection for multivariate calibration using a genetic algorithm: prediction of additive concentrations in polymer films from Fourier transform-infrared spectral data , 2002 .

[14]  T. Næs,et al.  The Effect of Multiplicative Scatter Correction (MSC) and Linearity Improvement in NIR Spectroscopy , 1988 .

[15]  M. Dyrby,et al.  Chemometric Quantitation of the Active Substance (Containing C≡N) in a Pharmaceutical Tablet Using Near-Infrared (NIR) Transmittance and NIR FT-Raman Spectra , 2002 .

[16]  C. Jun,et al.  Performance of some variable selection methods when multicollinearity is present , 2005 .

[17]  E. V. Thomas,et al.  Partial least-squares methods for spectral analyses. 1. Relation to other quantitative calibration methods and the extraction of qualitative information , 1988 .

[18]  W. Velicer,et al.  Relation of sample size to the stability of component patterns. , 1988, Psychological bulletin.

[19]  Zou Xiaobo,et al.  Genetic Algorithm Interval Partial Least Squares Regression Combined Successive Projections Algorithm for Variable Selection in Near-Infrared Quantitative Analysis of Pigment in Cucumber Leaves , 2010 .

[20]  Todd C. Rasmussen,et al.  Advances in variable selection methods I: Causal selection methods versus stepwise regression and principal component analysis on data of known and unknown functional relationships , 2012 .

[21]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[22]  Roman M. Balabin,et al.  Variable selection in near-infrared spectroscopy: benchmarking of feature selection methods on biodiesel data. , 2011, Analytica chimica acta.

[23]  Todd C. Rasmussen,et al.  Advances in variable selection methods II: Effect of variable selection method on classification of hydrologically similar watersheds in three Mid-Atlantic ecoregions , 2012 .

[24]  Elaine B. Martin,et al.  Model selection for partial least squares regression , 2002 .

[25]  Dong-Sheng Cao,et al.  Recipe for revealing informative metabolites based on model population analysis , 2010, Metabolomics.

[26]  Howard Mark,et al.  A Monte Carlo Study of the Effect of Noise on Wavelength Selection during Computerized Wavelength Searches , 1988 .

[27]  R Todeschini,et al.  Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications. , 2009, Analytica chimica acta.

[28]  Rasmus Bro,et al.  Variable selection in regression—a tutorial , 2010 .

[29]  G. W. Small,et al.  Determination of glucose in a biological matrix by multivariate analysis of multiple band-pass-filtered Fourier transform near-infrared interferograms. , 1997, Analytical chemistry.

[30]  Jana Novovicová,et al.  Evaluating Stability and Comparing Output of Feature Selectors that Optimize Feature Subset Cardinality , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Lawrence. Davis,et al.  Handbook Of Genetic Algorithms , 1990 .

[32]  Royston Goodacre,et al.  Genetic algorithm optimization for pre-processing and variable selection of spectroscopic data , 2005, Bioinform..

[33]  Anne-Laure Boulesteix,et al.  Stability and aggregation of ranked gene lists , 2009, Briefings Bioinform..

[34]  Roberto Todeschini,et al.  Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 3. Variable selection in classification. , 2010, Analytica chimica acta.

[35]  Zengyou He,et al.  Stable Feature Selection for Biomarker Discovery , 2010, Comput. Biol. Chem..

[36]  Israel Schechter,et al.  Wavelength Selection for Simultaneous Spectroscopic Analysis. Experimental and Theoretical Study , 1996 .

[37]  G. W. Small,et al.  Comparison of combination and first overtone spectral regions for near-infrared calibration models for glucose and other biomolecules in aqueous solutions. , 2004, Analytical chemistry.

[38]  Jonas Johansson,et al.  Comparison of different variable selection methods conducted on NIR transmission measurements on intact tablets , 2003 .

[39]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[40]  R Todeschini,et al.  Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data Part 2. Variable reduction. , 2009, Analytica chimica acta.

[41]  R. Teófilo,et al.  Sorting variables by using informative vectors as a strategy for feature selection in multivariate regression , 2009 .

[42]  S Forrest,et al.  Genetic algorithms , 1996, CSUR.

[43]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .