Improving emotion classification in imbalanced YouTube dataset using SMOTE algorithm

The imbalanced dataset problem triggers degradation of classification performance in several data mining applications including pattern recognition, text categorization, and information filtering tasks. To improve emotion classification performance, we use a sampling-based algorithm called SMOTE, which oversamples instances in a minority class to the number of those from the majority class. YouTube dataset was balanced using the SMOTE technique and tested using three machine learning algorithms, namely multinomial Naïve Bayes (MNB), decision tree (DT) and support vector machines (SVM). As a result, SVM achieves the highest accuracy with 93.30% on filtering task and 89.44% on classification. The SMOTE technique can solve the imbalanced data problem and obtain an improved classification result.

[1]  P. Manikandan,et al.  IMBALANCED DATASET CLASSIFICATION AND SOLUTIONS : A REVIEW , 2014 .

[2]  Mantao Xu,et al.  Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding , 2006, 2006 8th international Conference on Signal Processing.

[3]  Yetian Chen,et al.  Learning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets , 2008 .

[4]  Foster Provost,et al.  Machine Learning from Imbalanced Data Sets 101 , 2008 .

[5]  Amit Goel,et al.  OUPS: A Combined Approach Using SMOTE and Propensity Score Matching , 2014, 2014 13th International Conference on Machine Learning and Applications.

[6]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[7]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[8]  Yanling Li,et al.  Data Imbalance Problem in Text Classification , 2010, 2010 Third International Symposium on Information Processing.

[9]  K. Lokanayaki,et al.  Data Preprocessing for Liver Dataset Using SMOTE , 2013 .

[10]  Hui Li,et al.  Application of Random-SMOTE on Imbalanced Data Mining , 2011, 2011 Fourth International Conference on Business Intelligence and Financial Engineering.

[11]  Mohammad Mansour Riahi Kashani,et al.  Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset , 2013, ArXiv.