One Million Posts: A Data Set of German Online Discussions

In this paper we introduce a new data set consisting of user comments posted to the website of a German-language Austrian newspaper. Professional forum moderators have annotated 11,773 posts according to seven categories they considered crucial for the efficient moderation of online discussions in the context of news articles. In addition to this taxonomy and annotated posts, the data set contains one million unlabeled posts. Our experimental results using six methods establish a first baseline for predicting these categories. The data and our code are available for research purposes from https://ofai.github.io/million-post-corpus.

[1]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[4]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Tim Weninger,et al.  An exploration of submissions and discussions in social news: mining collective intelligence of Reddit , 2014, Social Network Analysis and Mining.

[8]  Ralf Peters,et al.  Detecting Offensive Statements towards Foreigners in Social Media , 2017, HICSS.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Leonidas J. Guibas,et al.  Shape google: Geometric words and expressions for invariant shape retrieval , 2011, TOGS.

[11]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[12]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[13]  ChengXiang Zhai,et al.  Learning online discussion structures by conditional random fields , 2011, SIGIR.

[14]  Prasenjit Mitra,et al.  Identifying the role of individual user messages in an online discussion and its use in thread retrieval , 2016, J. Assoc. Inf. Sci. Technol..