An Improved Ensemble Method for Completely Automatic Optimization of Spectral Interval Selection in Multivariate Calibration

In our recent work, Monte Carlo Cross Validation Stacked Regression (MCCVSR) is proposed to achieve automatic optimization of spectral interval selection in multivariate calibration. Though MCCVSR performs well in normal conditions, it is still necessary to improve it for more general applications. According to the well-known principle of “garbage in, garbage out (GIGO)”, as a precise ensemble method, MCCVSR might be influenced by outlying and very bad submodels. In this paper, a statistical test is designed to exclude the ruinous submodels from the ensemble learning process, therefore, the combination process becomes more reliable. Though completely automated, the proposed method is adjustable according to the nature of the data analyzed, including the size of training samples, resolution of spectra and quantitative potentials of the submodels. The effectiveness of the submodel refining is demonstrated by the investigation of a real standard data.

[1]  B. Nadler,et al.  The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration , 2005 .

[2]  John H. Kalivas,et al.  Global optimization by simulated annealing with wavelength selection for ultraviolet-visible spectrophotometry , 1989 .

[3]  David H. Wolpert,et al.  A Mathematical Theory of Generalization: Part I , 1990, Complex Syst..

[4]  A K Smilde,et al.  Influence of temperature on vibrational spectra and consequences for the predictive ability of multivariate models. , 1998, Analytical chemistry.

[5]  A. Höskuldsson Variable and subset selection in PLS regression , 2001 .

[6]  Rasmus Bro,et al.  Exploring the phenotypic expression of a regulatory proteome-altering gene by spectroscopy and chemometrics , 2001 .

[7]  Wen‐Jun Zhang,et al.  Comparison of different methods for variable selection , 2001 .

[8]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[9]  R. Fisher FREQUENCY DISTRIBUTION OF THE VALUES OF THE CORRELATION COEFFIENTS IN SAMPLES FROM AN INDEFINITELY LARGE POPU;ATION , 1915 .

[10]  M. Hubert,et al.  Robust methods for partial least squares regression , 2003 .

[11]  David H. Wolpert,et al.  A Mathematical Theory of Generalization: Part II , 1990, Complex Syst..

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  Joel M. Harris,et al.  Selection of analytical wavelengths for multicomponent spectrophotometric determinations , 1985 .

[14]  D. L. Hawkins Using U Statistics to Derive the Asymptotic Distribution of Fisher's Z Statistic , 1989 .

[15]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[16]  Jian-hui Jiang,et al.  Spectral regions selection to improve prediction ability of PLS models by changeable size moving window partial least squares and searching combination moving window partial least squares , 2004 .

[17]  C. Spiegelman,et al.  Theoretical Justification of Wavelength Selection in PLS Calibration:  Development of a New Algorithm. , 1998, Analytical Chemistry.

[18]  S. Wold,et al.  Wavelength interval selection in multicomponent spectral analysis by moving window partial least-squares regression with applications to mid-infrared and near-infrared spectroscopic data. , 2002, Analytical chemistry.

[19]  Jian-hui Jiang,et al.  MCCV stacked regression for model combination and fast spectral interval selection in multivariate calibration , 2007 .

[20]  S. Engelsen,et al.  Interval Partial Least-Squares Regression (iPLS): A Comparative Chemometric Study with an Example from Near-Infrared Spectroscopy , 2000 .

[21]  R. Leardi,et al.  Sequential application of backward interval partial least squares and genetic algorithms for the selection of relevant spectral regions , 2004 .

[22]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[23]  Ronald D. Snee,et al.  Validation of Regression Models: Methods and Examples , 1977 .

[24]  Desire L. Massart,et al.  Comparison of multivariate methods based on latent vectors and methods based on wavelength selection for the analysis of near-infrared spectroscopic data , 1995 .

[25]  Peter Filzmoser,et al.  Partial robust M-regression , 2005 .

[26]  Susan L. Rose-Pehrsson,et al.  Automated wavelength selection for spectroscopic fuel models by symmetrically contracting repeated unmoving window partial least squares , 2008 .

[27]  Chris W. Brown,et al.  Matrix representations and criteria for selecting analytical wavelengths for multicomponent spectroscopic analysis , 1982 .

[28]  M. Hubert,et al.  A robust PCR method for high‐dimensional regressors , 2003 .