Towards Scalable Emotion Classification in Microblog Based on Noisy Training Data

The availability of labeled corpus is of great importance for emotion classification tasks. Because manual labeling is too time-consuming, hashtags have been used as naturally annotated labels to obtain large amount of labeled training data from microblog. However, the inconsistency and noise in annotation can adversely affect the data quality and thus the performance when used to train a classifier. In this paper, we propose a classification framework which allows naturally annotated data to be used as additional training data and employs a k-NN graph based data cleaning method to remove noise after noisy data has certain accumulations. Evaluation on NLP&CC2013 Chinese Weibo emotion classification dataset shows that our approach achieves 15.8 % better performance than directly using the noisy data without noise filtering. After adding the filtered data with hashtags into an existing high-quality training data, the performance increases 3.7 % compared to using the high-quality training data alone.

[1]  Carlo Strapparava,et al.  SemEval-2007 Task 14: Affective Text , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[2]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[3]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[4]  Saif Mohammad,et al.  Using Hashtags to Capture Fine Emotion Categories from Tweets , 2015, Comput. Intell..

[5]  Saif Mohammad,et al.  #Emotional Tweets , 2012, *SEMEVAL.

[6]  Amit P. Sheth,et al.  Harnessing Twitter "Big Data" for Automatic Emotion Identification , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[7]  Xiaolong Wang,et al.  Cross-lingual Opinion Analysis via Negative Transfer Detection , 2014, ACL.

[8]  Zhi-Hua Zhou,et al.  CoTrade: Confident Co-Training With Data Editing , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[9]  Stewart Massie,et al.  Generating a Word-Emotion Lexicon from #Emotional Tweets , 2014, *SEMEVAL.

[10]  Xiaojun Wan,et al.  Collaborative Data Cleaning for Sentiment Classification with Noisy Training Corpus , 2011, PAKDD.

[11]  A. Mehrabian Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament , 1996 .

[12]  Changqin Quan,et al.  Construction of a Blog Emotion Corpus for Chinese Emotional Expression Analysis , 2009, EMNLP.

[13]  Robert D. Nowak,et al.  Multi-Manifold Semi-Supervised Learning , 2009, AISTATS.