A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian

Abusive speech in social media, including profanities, derogatory and hate speech, has reached the level of a pandemic. A system that would be able to detect such texts could help in making the Internet and social media a better and more respectful virtual space. Research and commercial application in this area were so far focused mainly on the English language. This paper presents the work on building AbCoSER, the first corpus of abusive speech in Serbian. The corpus consists of 6,436 manually annotated tweets, out of which 1,416 were labelled as tweets using some kind of abusive speech. Those 1,416 tweets were further sub-classified, for instance to those using vulgar, hate speech, derogatory language, etc. In this paper, we explain the process of data acquisition, annotation, and corpus construction. We also discuss the results of an initial analysis of the annotation quality. Finally, we present an abusive speech lexicon structure and its enrichment with abusive triggers extracted from the AbCoSER dataset. 2012 ACM Subject Classification Computing methodologies → Natural language processing

[1]  Ted Pedersen,et al.  Duluth at SemEval-2019 Task 6: Lexical Approaches to Identify and Categorize Offensive Tweets , 2019, *SEMEVAL.

[2]  FISCAL STRATEGY,et al.  Government of the Republic of Serbia , 2003 .

[3]  Michael Wiegand,et al.  Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language , 2018 .

[4]  Tommaso Caselli,et al.  I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language , 2020, LREC.

[5]  Christian Chiarcos,et al.  Modelling Frequency and Attestations for OntoLex-Lemon , 2020, GLOBALEX@LREC.

[6]  Çağrı Çöltekin,et al.  A Corpus of Turkish Offensive Language on Social Media , 2020, LREC.

[7]  Viviana Patti,et al.  Hurtlex: A Multilingual Lexicon of Words to Hurt , 2018, CLiC-it.

[8]  Derek Ruths,et al.  A Web of Hate: Tackling Hateful Speech in Online Social Spaces , 2017, ArXiv.

[9]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[10]  Krishnaprasad Thirunarayan,et al.  ALONE: A Dataset for Toxic Behavior among Adolescents on Twitter , 2020, SocInfo.

[11]  Jean Véronis,et al.  Visualising a Text with a Tree Cloud , 2009 .

[12]  Viviana Patti,et al.  Do You Really Want to Hurt Me? Predicting Abusive Swearing in Social Media , 2020, LREC.

[13]  Lucas Dixon,et al.  Ex Machina: Personal Attacks Seen at Scale , 2016, WWW.

[14]  Khalid Choukri,et al.  The european language resources association , 1998, LREC.

[15]  Leon Derczynski,et al.  Offensive Language and Hate Speech Detection for Danish , 2019, LREC.

[16]  Fabrício Benevenuto,et al.  Analyzing the Targets of Hate in Online Social Media , 2016, ICWSM.

[17]  Mai ElSherief,et al.  Hate Lingo: A Target-based Linguistic Analysis of Hate Speech in Social Media , 2018, ICWSM.

[18]  Björn Ross,et al.  Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.

[19]  Paula Fortuna,et al.  Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets , 2020, LREC.

[20]  Christian Chiarcos,et al.  The ACoLi Dictionary Graph , 2020, LREC.

[21]  Dolf Trieschnigg,et al.  Improving Cyberbullying Detection with User Context , 2013, ECIR.

[22]  Cvetana Krstev,et al.  Multi-word Expressions for Abusive Speech Detection in Serbian , 2020, MWE.

[23]  Amit P. Sheth,et al.  A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research , 2018, WebSci.

[24]  Asunción Gómez-Pérez,et al.  Interchanging lexical resources on the Semantic Web , 2012, Language Resources and Evaluation.

[25]  Christophe Gravier,et al.  Dict2vec : Learning Word Embeddings using Lexical Dictionaries , 2017, EMNLP.

[26]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[27]  Julia Bosque-Gil,et al.  Towards a Module for Lexicography in OntoLex , 2017, LDK Workshops.

[28]  Mihailo Škorić,et al.  From DELA Based Dictionary to Leximirka Lexical Database , 2019 .

[29]  Tomaz Erjavec,et al.  The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English , 2019, TSD.

[30]  Pascale Fung,et al.  One-step and Two-step Classification for Abusive Language Detection on Twitter , 2017, ALW@ACL.

[31]  Cristina Bosco,et al.  Hate Speech Annotation: Analysis of an Italian Twitter Corpus , 2017, CLiC-it.

[32]  Yuzhou Wang,et al.  Locate the Hate: Detecting Tweets against Blacks , 2013, AAAI.

[33]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[34]  Julia Hirschberg,et al.  Detecting Hate Speech on the World Wide Web , 2012 .

[35]  Preslav Nakov,et al.  SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , 2020, SEMEVAL.

[36]  Viviana Patti,et al.  Misogyny Detection in Twitter: a Multilingual and Cross-Domain Study , 2020, Inf. Process. Manag..

[37]  Ying Chen,et al.  Detecting Offensive Language in Social Media to Protect Adolescent Online Safety , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[38]  Cvetana Krstev Can We Make the Bell Ring? , 2007 .

[39]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[40]  K. Kumaresan,et al.  HateSense: Tackling Ambiguity in Hate Speech Detection , 2019, 2019 National Information Technology Conference (NITC).

[41]  Michael Wiegand,et al.  Inducing a Lexicon of Abusive Words – a Feature-Based Approach , 2018, NAACL.

[42]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[43]  Endang Wahyu Pamungkas,et al.  Cross-domain and Cross-lingual Abusive Language Detection: A Hybrid Approach with Deep Learning and a Multilingual Lexicon , 2019, ACL.

[44]  Tommaso Caselli,et al.  GruPaTo at SemEval-2020 Task 12: Retraining mBERT on Social Media and Fine-tuned Offensive Language Models , 2020, SemEval@COLING.

[45]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[46]  Tymoteusz Krumholc,et al.  NLPR@SRPOL at SemEval-2019 Task 6 and Task 5: Linguistically enhanced deep learning offensive sentence classifier , 2019, SemEval@NAACL-HLT.

[47]  Stan Matwin,et al.  Offensive Language Detection Using Multi-level Classification , 2010, Canadian Conference on AI.

[48]  THE NORMATIVE FRAMEWORK OF HATE SPEECH IN SERBIA AND SERBIAN MEDIA , 2016 .

[49]  Tomaz Erjavec,et al.  Legal Framework, Dataset and Annotation Schema for Socially Unacceptable Online Discourse Practices in Slovene , 2017, ALW@ACL.

[50]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.