Classification of GC‐MS measurements of wines by combining data dimension reduction and variable selection techniques

Different classification methods (Partial Least Squares Discriminant Analysis, Extended Canonical Variates Analysis and Linear Discriminant Analysis), in combination with variable selection approaches (Forward Selection and Genetic Algorithms), were compared, evaluating their capabilities in the geographical discrimination of wine samples. Sixty‐two samples were analysed by means of dynamic headspace gas chromatography mass spectrometry (HS‐GC‐MS) and the entire chromatographic profile was considered to build the dataset. Since variable selection techniques pose a risk of overfitting when a large number of variables is used, a method for coupling data dimension reduction and variable selection was proposed. This approach compresses windows of the original data by retaining only significant components of local Principal Component Analysis models. The subsequent variable selection is then performed on these locally derived score variables. The results confirmed that the classification models achieved on the reduced data were better than those obtained on the entire chromatographic profile, with the exception of Extended Canonical Variates Analysis, which gave acceptable models in both cases. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  R. Leardi Genetic algorithms in chemometrics and chemistry: a review , 2001 .

[2]  Svante Wold,et al.  Pattern recognition by means of disjoint principal components models , 1976, Pattern Recognit..

[3]  R. Leardi,et al.  Genetic algorithms applied to feature selection in PLS regression: how and when to use them , 1998 .

[4]  Rasmus Bro,et al.  A new approach for modelling sensor based data , 2005 .

[5]  Rasmus Bro,et al.  A modification of canonical variates analysis to handle highly collinear multivariate data , 2006 .

[6]  Sam Silverman User's Perspective , 1995 .

[7]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[8]  Geoffrey J. McLachlan,et al.  Discriminant Analysis and Statistical Pattern Recognition: McLachlan/Discriminant Analysis & Pattern Recog , 2005 .

[9]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[10]  Rasmus Bro,et al.  Automated alignment of chromatographic data , 2006 .

[11]  D. Massart,et al.  UNEQ: a disjoint modelling technique for pattern recognition based on normal distribution , 1986 .

[12]  Wojtek J. Krzanowski,et al.  Principles of multivariate analysis : a user's perspective. oxford , 1988 .

[13]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[14]  K. Varmuza,et al.  Feature selection by genetic algorithms for mass spectral classifiers , 2001 .

[15]  S. Wold,et al.  Partial least squares analysis with cross‐validation for the two‐class problem: A Monte Carlo study , 1987 .

[16]  Riccardo Leardi,et al.  Application of genetic algorithm–PLS for feature selection in spectral data sets , 2000 .

[17]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[18]  Jerome H. Friedman,et al.  Classification: Oldtimers and newcomers , 1989 .

[19]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[20]  Brian Everitt,et al.  Principles of Multivariate Analysis , 2001 .

[21]  David J. Hand,et al.  Discrimination and Classification , 1982 .

[22]  B. Kowalski,et al.  K-Nearest Neighbor Classification Rule (pattern recognition) applied to nuclear magnetic resonance spectral interpretation , 1972 .

[23]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[24]  Roberto Todeschini,et al.  The data analysis handbook , 1994, Data handling in science and technology.

[25]  Svante Wold,et al.  Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection , 1996 .

[26]  R. Bro,et al.  Multiblock variance partitioning: a new approach for comparing variation in multiple data blocks. , 2008, Analytica chimica acta.

[27]  Desire L. Massart,et al.  Potential methods in pattern recognition : Part 3. Feature selection with ALLOC , 1981 .

[28]  D. Coomans,et al.  Potential methods in pattern recognition : Part 1. Classification aspects of the supervised method ALLOC , 1981 .

[29]  Mike James,et al.  Classification Algorithms , 1986, Encyclopedia of Machine Learning and Data Mining.

[30]  Ildiko E. Frank,et al.  DASCO — a new classification method , 1988 .