A comparison of different chemometrics approaches for the robust classification of electronic nose data

Accurate detection of certain chemical vapours is important, as these may be diagnostic for the presence of weapons, drugs of misuse or disease. In order to achieve this, chemical sensors could be deployed remotely. However, the readout from such sensors is a multivariate pattern, and this needs to be interpreted robustly using powerful supervised learning methods. Therefore, in this study, we compared the classification accuracy of four pattern recognition algorithms which include linear discriminant analysis (LDA), partial least squares-discriminant analysis (PLS-DA), random forests (RF) and support vector machines (SVM) which employed four different kernels. For this purpose, we have used electronic nose (e-nose) sensor data (Wedge et al., Sensors Actuators B Chem 143:365–372, 2009). In order to allow direct comparison between our four different algorithms, we employed two model validation procedures based on either 10-fold cross-validation or bootstrapping. The results show that LDA (91.56 % accuracy) and SVM with a polynomial kernel (91.66 % accuracy) were very effective at analysing these e-nose data. These two models gave superior prediction accuracy, sensitivity and specificity in comparison to the other techniques employed. With respect to the e-nose sensor data studied here, our findings recommend that SVM with a polynomial kernel should be favoured as a classification method over the other statistical models that we assessed. SVM with non-linear kernels have the advantage that they can be used for classifying non-linear as well as linear mapping from analytical data space to multi-group classifications and would thus be a suitable algorithm for the analysis of most e-nose sensor data.

[1]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[2]  Douglas B. Kell,et al.  Real-time vapour sensing using an OFET-based electronic nose and genetic programming , 2009 .

[3]  R. Brereton,et al.  Partial least squares discriminant analysis: taking the magic away , 2014 .

[4]  Peter J Sterk,et al.  An electronic nose in the discrimination of patients with asthma and controls. , 2007, The Journal of allergy and clinical immunology.

[5]  David I. Ellis,et al.  A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. , 2014, Analytica chimica acta.

[6]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[7]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[8]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[9]  D B Kell,et al.  Rapid identification of urinary tract infection bacteria using hyperspectral whole-organism fingerprinting and artificial neural networks. , 1998, Microbiology.

[10]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .

[11]  Douglas B. Kell,et al.  Proposed minimum reporting standards for data analysis in metabolomics , 2007, Metabolomics.

[12]  Zulfiqur Ali,et al.  Data analysis for electronic nose systems , 2006 .

[13]  R. Brereton,et al.  Comparison of performance of five common classifiers represented as boundary methods: Euclidean Distance to Centroids, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Learning Vector Quantization and Support Vector Machines, as dependent on data structure , 2009 .

[14]  P. Burman A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods , 1989 .

[15]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[16]  Giorgio Pennazza,et al.  A preliminary study on the possibility to diagnose urinary tract cancers by an electronic nose , 2008 .

[17]  J. Gastwirth The Estimation of the Lorenz Curve and Gini Index , 1972 .

[18]  Kurt Hornik,et al.  Support Vector Machines in R , 2006 .

[19]  N. Ancona,et al.  Support vector machines for olfactory signals recognition , 2003 .

[20]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[21]  R. Brereton Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data , 2006 .

[22]  Richard G. Brereton,et al.  Chemometrics for Pattern Recognition , 2009 .

[23]  Cullen Schaffer,et al.  Selecting a classification method by cross-validation , 1993, Machine Learning.

[24]  D B Kell,et al.  Genetic programming:  a novel method for the quantitative analysis of pyrolysis mass spectral data. , 1997, Analytical chemistry.

[25]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[26]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[27]  Bryan F. J. Manly,et al.  Multivariate Statistical Methods : A Primer , 1986 .

[28]  Manuel A. Sánchez-Montañés,et al.  Chemical Sensor Array Optimization: Geometric and Information Theoretic Approaches , 2002 .

[29]  R. Brereton,et al.  Support vector machines for classification and regression. , 2010, The Analyst.

[30]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[31]  E. Martinelli,et al.  Lung cancer identification by the analysis of breath by means of an array of non-selective gas sensors. , 2003, Biosensors & bioelectronics.

[32]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[33]  Jason Weston,et al.  A user's guide to support vector machines. , 2010, Methods in molecular biology.

[34]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[35]  M. Pardo,et al.  Classification of electronic nose data with support vector machines , 2005 .

[36]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[37]  Nicholas Stone,et al.  Investigation of support vector machines and Raman spectroscopy for lymph node diagnostics. , 2010, The Analyst.

[38]  Daniel Cozzolino,et al.  Classification of Tempranillo wines according to geographic origin: combination of mass spectrometry based electronic nose and chemometrics. , 2010, Analytica chimica acta.

[39]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[40]  N. Bârsan,et al.  Electronic nose: current status and future trends. , 2008, Chemical reviews.

[41]  Chih-Jen Lin,et al.  Training and Testing Low-degree Polynomial Data Mappings via Linear SVM , 2010, J. Mach. Learn. Res..

[42]  M. Pardo,et al.  Random forests and nearest shrunken centroids for the classification of sensor array data , 2008 .

[43]  Peter C. Jurs,et al.  Computational Methods for the Analysis of Chemical Sensor Array Data from Volatile Analytes , 2000 .

[44]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[45]  Anil K. Jain,et al.  Bootstrap Techniques for Error Estimation , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[47]  Landon Oakes,et al.  Toward the nanospring-based artificial olfactory system for trace-detection of flammable and explosive vapors , 2012 .

[48]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[49]  J. Brezmes,et al.  Variable selection for support vector machine based multisensor systems , 2007 .

[50]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[51]  Yun Xu,et al.  Support Vector Machines: A Recent Method for Classification in Chemometrics , 2006 .

[52]  Yunqian Ma,et al.  Practical selection of SVM parameters and noise estimation for SVM regression , 2004, Neural Networks.

[53]  L. T. Tanoue Detection of Lung Cancer by Sensor Array Analyses of Exhaled Breath , 2007 .

[54]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.