Random forests and nearest shrunken centroids for the classification of sensor array data

Abstract Random forests and nearest shrunken centroids are between the most promising new classification methodologies. In this paper we apply them – to our knowledge for the first time – to the classification of three E-Nose datasets for food quality control applications. We compare the classification rate with the one obtained by state-of-the-art support vector machines. Classifiers’ parameters are optimized in an inner cross-validation cycle and the error is calculated by outer cross-validation in order to avoid any bias. Since nested cross-validation is computationally expensive we also investigate the dependence of the error on the number of inner and outer folds. We find that random forests and support vector machines have a similar classification performance, while nearest shrunken centroids have worse performances. On the other hand, random forests and nearest shrunken centroids have an in-built feature selection mechanism that is very helpful for understanding the structure of the dataset and evaluating sensors. We show that random forests and nearest shrunken centroids produce different feature rankings and explain our findings with the nature of the classifier. Computations are carried out with the powerful statistical packages diffused by the R project for statistical computing.

[1]  C. Furlanello,et al.  Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products , 2006 .

[2]  N. Ancona,et al.  Support vector machines for olfactory signals recognition , 2003 .

[3]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Robert Gentleman,et al.  Statistical Analyses and Reproducible Research , 2007 .

[6]  T. Hancock,et al.  A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies , 2005 .

[7]  J. Friedman Stochastic gradient boosting , 2002 .

[8]  E. Martinelli,et al.  Feature Extraction of chemical sensors in phase space , 2003 .

[9]  G. Sberveglieri,et al.  Comparing the performance of different features in sensor arrays , 2007 .

[10]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[11]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[12]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[13]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[16]  M. Pardo,et al.  Classification of electronic nose data with support vector machines , 2005 .

[17]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[18]  Wolfgang Huber,et al.  A Compendium to Ensure Computational Reproducibility in High-Dimensional Classification Tasks , 2004, Statistical applications in genetics and molecular biology.

[19]  Giorgio Sberveglieri,et al.  Detection of toxigenic strains of Fusarium verticillioides in corn by electronic olfactory system , 2005 .

[20]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[21]  T. Hancock,et al.  Bagged super wavelets reduction for boosted prostate cancer classification of seldi-tof mass spectral serum profiles , 2006 .

[22]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[23]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .