A Twitter Corpus and Benchmark Resources for German Sentiment Analysis

In this paper we present SB10k, a new corpus for sentiment analysis with approx. 10,000 German tweets. We use this new corpus and two existing corpora to provide state-of-the-art benchmarks for sentiment analysis in German: we implemented a CNN (based on the winning system of SemEval-2016) and a feature-based SVM and compare their performance on all three corpora. For the CNN, we also created German word embeddings trained on 300M tweets. These word embeddings were then optimized for sentiment analysis using distant-supervised learning. The new corpus, the German word embeddings (plain and optimized), and source code to re-run the benchmarks are publicly available.

[1]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[2]  Leon Derczynski,et al.  Swiss-Chocolate: Combining Flipout Regularization and Random Forests with Artificially Built Subsystems to Boost Text-Classification for Sentiment , 2015, SemEval@NAACL-HLT.

[3]  Jianfeng Gao,et al.  Learning Continuous Phrase Representations for Translation Modeling , 2014, ACL.

[4]  Alessandro Moschitti,et al.  UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification , 2015, *SEMEVAL.

[5]  Cristina Bosco,et al.  Annotating Sentiment and Irony in the Online Italian Political Debate on #labuonascuola , 2016, LREC.

[6]  Walter Daelemans,et al.  CLiPS Stylometry Investigation (CSI) corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text , 2014, LREC.

[7]  Luis Alfonso Ureña López,et al.  Polarity classification for Spanish tweets using the COST corpus , 2015, J. Inf. Sci..

[8]  Uladzimir Sidarenka PotTS: The Potsdam Twitter Sentiment Corpus , 2016, LREC.

[9]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[10]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[11]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[12]  Julio Villena-Román,et al.  TASS 2015 - The Evolution of the Spanish Opinion Mining Systems , 2016, Proces. del Leng. Natural.

[13]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[14]  Cícero Nogueira dos Santos,et al.  Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts , 2014, COLING.

[15]  Tong Zhang,et al.  Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding , 2015, NIPS.

[16]  Preslav Nakov,et al.  SemEval-2014 Task 9: Sentiment Analysis in Twitter , 2014, *SEMEVAL.

[17]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .

[18]  Julia Maria Struß,et al.  IGGSA Shared Tasks on German Sentiment Analysis (GESTALT) , 2014 .

[19]  Fatih Uzdilli,et al.  Potential and Limitations of Commercial Sentiment Detection Tools , 2013, ESSEM@AI*IA.

[20]  Hinrich Schütze,et al.  Ultradense Word Embeddings by Orthogonal Transformation , 2016, NAACL.

[21]  Kang Liu,et al.  Book Review: Sentiment Analysis: Mining Opinions, Sentiments, and Emotions by Bing Liu , 2015, CL.

[22]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter. , 2019 .

[23]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[24]  Aurélien Lucchi,et al.  SwissCheese at SemEval-2016 Task 4: Sentiment Classification Using an Ensemble of Convolutional Neural Networks with Distant Supervision , 2016, *SEMEVAL.

[25]  Thomas Hofmann,et al.  Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification , 2017, WWW.

[26]  S. Albayrak,et al.  Language-Independent Twitter Sentiment Analysis , 2012 .

[27]  Iadh Ounis,et al.  Overview of the TREC 2008 Blog Track , 2008, TREC.

[28]  Cristina Bosco,et al.  Tweeting and Being Ironic in the Debate about a Political Reform: the French Annotated Corpus TWitter-MariagePourTous , 2016, LREC.

[29]  Preslav Nakov,et al.  SemEval-2015 Task 10: Sentiment Analysis in Twitter , 2015, *SEMEVAL.

[30]  Ulli Waltinger,et al.  GermanPolarityClues: A Lexical Resource for German Sentiment Analysis , 2010, LREC.

[31]  Igor Mozetic,et al.  Multilingual Twitter Sentiment Classification: The Role of Human Annotators , 2016, PloS one.

[32]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[33]  Jonathon Read,et al.  Using Emoticons to Reduce Dependency in Machine Learning Techniques for Sentiment Classification , 2005, ACL.