Simultaneous data pre-processing and SVM classification model selection based on a parallel genetic algorithm applied to spectroscopic data of olive oils.

Classification is an important task in chemometrics. For several years now, support vector machines (SVMs) have proven to be powerful for infrared spectral data classification. However such methods require optimisation of parameters in order to control the risk of overfitting and the complexity of the boundary. Furthermore, it is established that the prediction ability of classification models can be improved using pre-processing in order to remove unwanted variance in the spectra. In this paper we propose a new methodology based on genetic algorithm (GA) for the simultaneous optimisation of SVM parameters and pre-processing (GENOPT-SVM). The method has been tested for the discrimination of the geographical origin of Italian olive oil (Ligurian and non-Ligurian) on the basis of near infrared (NIR) or mid infrared (FTIR) spectra. Different classification models (PLS-DA, SVM with mean centre data, GENOPT-SVM) have been tested and statistically compared using McNemar's statistical test. For the two datasets, SVM with optimised pre-processing give models with higher accuracy than the one obtained with PLS-DA on pre-processed data. In the case of the NIR dataset, most of this accuracy improvement (86.3% compared with 82.8% for PLS-DA) occurred using only a single pre-processing step. For the FTIR dataset, three optimised pre-processing steps are required to obtain SVM model with significant accuracy improvement (82.2%) compared to the one obtained with PLS-DA (78.6%). Furthermore, this study demonstrates that even SVM models have to be developed on the basis of well-corrected spectral data in order to obtain higher classification rates.

[1]  L. Duponchel,et al.  Support vector machines (SVM) in near infrared (NIR) spectroscopy: Focus on parameters optimization and model interpretation , 2009 .

[2]  Ludovic Duponchel,et al.  Parallel genetic algorithm co-optimization of spectral pre-processing and wavelength selection for PLS regression , 2011 .

[3]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[4]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[5]  Frans van den Berg,et al.  Review of the most common pre-processing techniques for near-infrared spectra , 2009 .

[6]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[7]  Mokhtar S. Bazaraa,et al.  Nonlinear Programming: Theory and Algorithms , 1993 .

[8]  Cyril Ruckebusch,et al.  Statistical tests for comparison of quantitative and qualitative models developed with near infrared spectral data , 2003 .

[9]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[10]  Yun Xu,et al.  Support Vector Machines: A Recent Method for Classification in Chemometrics , 2006 .

[11]  Melanie Mitchell,et al.  An introduction to genetic algorithms , 1996 .

[12]  Kristin P. Bennett,et al.  A Pattern Search Method for Model Selection of Support Vector Regression , 2002, SDM.

[13]  Roman M. Balabin,et al.  Gasoline classification using near infrared (NIR) spectroscopy data: comparison of multivariate techniques. , 2010, Analytica chimica acta.

[14]  R. Barnes,et al.  Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance Spectra , 1989 .

[15]  Paul Tseng,et al.  A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training , 2010, Comput. Optim. Appl..

[16]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[17]  El-Ghazali Talbi,et al.  Metaheuristics - From Design to Implementation , 2009 .

[18]  Gerard Downey,et al.  Confirmation of food origin claims by fourier transform infrared spectroscopy and chemometrics: extra virgin olive oil from Liguria. , 2009, Journal of agricultural and food chemistry.

[19]  Gerard Downey,et al.  Confirmation of declared provenance of European extra virgin olive oil samples by NIR spectroscopy. , 2008, Journal of agricultural and food chemistry.

[20]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[21]  K. Héberger,et al.  Supervised pattern recognition in food analysis. , 2007, Journal of chromatography. A.