Modern data mining tools in descriptive sensory analysis: A case study with a Random forest approach

In this paper we introduce random forest (RF) as a new modeling technique in the field of sensory analysis. As a case study we apply RF to the predictive discrimination of six typical cheeses of the Trentino province (North Italy) from data obtained by quantitative descriptive analysis. The corresponding sensory profiling was carried out by eight trained assessors using a developed language containing 35 attributes. We compare RFs discrimination capabilities with linear discriminant analysis (LDA) and discriminant partial least square (dPLS). The RF models result more accurate, with smaller prediction errors than LDA and dPLS. RF also offers the possibility of graphically analyzing the developed models with multi-dimensional scaling plots based on an internal measure of similarity between samples. We compare these plots with similar ones derived from principal component analysis and LDA, finding that the same qualitative information can be extracted from all methods. The RF model also gives an estimation of the relative importance of each sensory attribute for the discriminant function. We couple this measure with an appropriate experimental setup in order to obtain an unbiased and stable method for variable selection. We favorably compare this method with sequential selection based on LDA models.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Cesare Furlanello,et al.  Entropy-based gene ranking without selection bias for the predictive classification of microarray data , 2003, BMC Bioinformatics.

[3]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[4]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[5]  Tin Kam Ho,et al.  MULTIPLE CLASSIFIER COMBINATION: LESSONS AND NEXT STEPS , 2002 .

[6]  Bruce R. Kowalski,et al.  Chemometrics, mathematics and statistics in chemistry , 1984 .

[7]  Pablo M. Granitto,et al.  Neural network ensembles: evaluation of aggregation algorithms , 2005, Artif. Intell..

[8]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[9]  F. J. Pérez-Elortondo,et al.  Chemical references in sensory analysis of smoke flavourings , 2002 .

[10]  Frances R. Jack,et al.  Modelling the sensory characteristics of Scotch whisky using neural networks—a novel tool for generic protection , 2002 .

[11]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[12]  Robert Sabatier,et al.  The ACT (STATIS method) , 1994 .

[13]  S. Wold,et al.  Multivariate Data Analysis in Chemistry , 1984 .

[14]  Horst Bunke,et al.  Hybrid methods in pattern recognition , 1987 .

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Kurt Varmuza,et al.  Multivariate Data Analysis in Chemistry , 2008 .

[18]  Flavia Gasperi,et al.  Judge selection for hard and semi-hard cheese sensory evaluation , 2000 .

[19]  R. Singleton,et al.  Sensory Evaluation by Quantitative Descriptive Analysis , 2008 .

[20]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[21]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[22]  V. Framondino,et al.  Ruolo dell'analisi sensoriale nella definizione delle caratteristiche dei prodotti tipici: l'esempio dei formaggi trentini , 2004 .

[23]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[24]  J. Gower Generalized procrustes analysis , 1975 .

[25]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[26]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[27]  Richard Popper,et al.  Analyzing Differences Among Products and Panelists by Multidimensional Scaling , 1996 .