Discussing the Value of Automatic Hate Speech Detection in Online Debates

This study discusses the potential value of automatic analytics of German texts to detect hate speech. In the course of a preliminary study, we collected a dataset of user comments on news articles, focused on the refugee crisis in 2015/16. A crowdsourcing approach was used to label a subset of the data as hateful and non-hateful to be used as training and evaluation data. Furthermore, a vocabulary was created containing the words that are indicating hate and no hate. The best performing combination of feature groups was a Word2Vec approach and Extended 2-grams. Our study builds upon previous research for English texts and demonstrates its transferability to German. The paper discusses the results with respect to the potential for media organizations and considerations about moderation techniques and algorithmic transparency.

[1]  Bing Liu,et al.  Sentiment Analysis and Opinion Mining , 2012, Synthesis Lectures on Human Language Technologies.

[2]  Saif Mohammad,et al.  CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON , 2013, Comput. Intell..

[3]  Mingliang Chen,et al.  Building emotional dictionary for sentiment analysis of online news , 2014, World Wide Web.

[4]  Nicholas Diakopoulos,et al.  Algorithmic Transparency in the News Media , 2017 .

[5]  Matthew Leighton Williams,et al.  Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making , 2015 .

[6]  B. Gardiner,et al.  The dark side of Guardian comments , 2016 .

[7]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[8]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[9]  Julia Hirschberg,et al.  Detecting Hate Speech on the World Wide Web , 2012 .

[10]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[11]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[12]  Jing Wang,et al.  Scrapy-Based Crawling and User-Behavior Characteristics Analysis on Taobao , 2012, 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery.

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.