Classification of wine quality with imbalanced data

We propose a data analysis approach to classify wine into different quality categories. A data set of white wines of 4898 observations obtained from the Minho region in Portugal was used in our analysis. As the occurrence of events in the data set was imbalanced with about 93% of the observations are from one category, we applied the Synthetic Minority Over-Sampling Technique (SMOTE) to over sample the minority class. The balanced data was used to model a classifier that categorizes a wine into three categories as high quality, normal quality, and poor quality. Three different classification techniques were used: decision tree, adaptive boosting (AdaBoost), and random forest. Our experiments show that the random forest technique seems to produce the desired results with the least percentage of error.

[1]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[2]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[3]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[4]  Aida Mustapha,et al.  Classification-based Data Mining Approach for Quality Control in Wine Production , 2012 .

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[7]  Alvaro Soto,et al.  Using data mining techniques to predict industrial wine problem fermentations , 2007 .

[8]  Ioannis S. Arvanitoyannis,et al.  Instrumental and sensory analysis of Greek wines; implementation of principal component analysis (PCA) for classification according to geographical origin , 2001 .

[9]  Li Zhu,et al.  Data Mining on Imbalanced Data Sets , 2008, 2008 International Conference on Advanced Computer Theory and Engineering.

[10]  S. Ebeler Linking Flavor Chemistry to Sensory Analysis of Wine , 1999 .

[11]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[12]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[13]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[14]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[15]  Carlos Herrero,et al.  Pattern recognition analysis applied to classification of wines from Galicia (northwestern Spain) with certified brand of origin , 1994 .