An ensemble variable selection method for vibrational spectroscopic data analysis

Wavelength selection is a critical factor for pattern recognition of vibrational spectroscopic data. Not only does it alleviate the effect of dimensionality on an algorithm's generalization performance, but it also enhances the understanding and interpretability of multivariate classification models. In this study, a novel partial least squares discriminant analysis (PLSDA)-based wavelength selection algorithm, termed ensemble of bootstrapping space shrinkage (EBSS), has been devised for vibrational spectroscopic data analysis. In the algorithm, a set of subsets are generated from a data set using random sampling. For an individual subset, a feature space is determined by maximizing the expected 10-fold cross-validation accuracy with a weighted bootstrap sampling strategy. Then an ensemble strategy and a sequential forward selection method are applied to the feature spaces to select characteristic variables. Experimental results obtained from analysis of real vibrational spectroscopic data sets demonstrate that the ensemble wavelength selection algorithm can reserve stable and informative variables for the final modeling and improve predictive ability for multivariate classification models.

[1]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[2]  Harald Martens,et al.  A Partial Least Squares based algorithm for parsimonious variable selection , 2011, Algorithms for Molecular Biology.

[3]  Miguel de la Guardia,et al.  Evaluation of the effect of chance correlations on variable selection using Partial Least Squares-Discriminant Analysis. , 2013, Talanta.

[4]  Lu Xu,et al.  Combining bootstrap and uninformative variable elimination: Chemometric identification of metabonomic biomarkers by nonparametric analysis of discriminant partial least squares , 2012 .

[5]  Dong-Sheng Cao,et al.  A bootstrapping soft shrinkage approach for variable selection in chemical modeling. , 2016, Analytica chimica acta.

[6]  R. Brereton,et al.  Partial least squares discriminant analysis: taking the magic away , 2014 .

[7]  A. Höskuldsson Variable and subset selection in PLS regression , 2001 .

[8]  Daniel Raftery,et al.  Combining NMR and LC/MS Using Backward Variable Elimination: Metabolomics Analysis of Colorectal Cancer, Polyps, and Healthy Controls. , 2016, Analytical chemistry.

[9]  Sumaporn Kasemsumran,et al.  Rapid Classification of Turmeric Based on DNA Fingerprint by Near-Infrared Spectroscopy Combined with Moving Window Partial Least Squares-Discrimination Analysis. , 2017, Analytical sciences : the international journal of the Japan Society for Analytical Chemistry.

[10]  M. Dyrby,et al.  Chemometric Quantitation of the Active Substance (Containing C≡N) in a Pharmaceutical Tablet Using Near-Infrared (NIR) Transmittance and NIR FT-Raman Spectra , 2002 .

[11]  Hadi Parastar,et al.  Classification of gas chromatographic fingerprints of saffron using partial least squares discriminant analysis together with different variable selection methods , 2016 .

[12]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Z. Ramadan,et al.  Metabolic profiling using principal component analysis, discriminant partial least squares, and genetic algorithms. , 2006, Talanta.

[15]  Philippe Besse,et al.  Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems , 2011, BMC Bioinformatics.

[16]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[17]  Xiaping Fu,et al.  Similar offspring voting genetic algorithm for spectral variable selection , 2017 .

[18]  R. Yu,et al.  An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. , 2008, Analytica chimica acta.

[19]  R. Brereton Pattern recognition in chemometrics , 2015 .

[20]  Yan-Ping Zhou,et al.  Partial least‐squares discriminant analysis optimized by particle swarm optimization: application to 1H nuclear magnetic resonance analysis of lung cancer metabonomics , 2015 .

[21]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[22]  Emma Brodrick,et al.  Data size reduction strategy for the classification of breath and air samples using multicapillary column-ion mobility spectrometry. , 2015, Analytical chemistry.

[23]  Y. Roggo,et al.  Detection and chemical profiling of medicine counterfeits by Raman spectroscopy and chemometrics. , 2011, Analytica chimica acta.

[24]  Yan-Ping Zhou,et al.  Particle swarm optimization-based protocol for partial least-squares discriminant analysis: Application to 1H nuclear magnetic resonance analysis of lung cancer metabonomics , 2014 .

[25]  Shungeng Min,et al.  A novel algorithm for spectral interval combination optimization. , 2016, Analytica chimica acta.

[26]  Alejandro C. Olivieri,et al.  A new family of genetic algorithms for wavelength interval selection in multivariate analytical spectroscopy , 2003 .

[27]  Alberto Ferrer,et al.  Chemometric approaches to improve PLSDA model outcome for predicting human non-alcoholic fatty liver disease using UPLC-MS as a metabolic profiling tool , 2011, Metabolomics.

[28]  Elena Marchiori,et al.  Convolutional neural networks for vibrational spectroscopic data analysis. , 2017, Analytica chimica acta.

[29]  D B Kell,et al.  Variable selection in discriminant partial least-squares analysis. , 1998, Analytical chemistry.

[30]  R. Bro,et al.  Multiblock variance partitioning: a new approach for comparing variation in multiple data blocks. , 2008, Analytica chimica acta.

[31]  Pierre Margot,et al.  Identification of pharmaceutical tablets by Raman spectroscopy and chemometrics. , 2010, Talanta.

[32]  Chao Liang,et al.  Soil type recognition as improved by genetic algorithm-based variable selection using near infrared spectroscopy and partial least squares discriminant analysis , 2015, Scientific Reports.

[33]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[34]  Roman M. Balabin,et al.  Gasoline classification using near infrared (NIR) spectroscopy data: comparison of multivariate techniques. , 2010, Analytica chimica acta.

[35]  E. K. Kemsley,et al.  FTIR spectroscopy and multivariate analysis can distinguish the geographic origin of extra virgin olive oils. , 2003, Journal of agricultural and food chemistry.