A Corpus of Turkish Offensive Language on Social Media

This paper introduces a corpus of Turkish offensive language. To our knowledge, this is the first corpus of offensive language for Turkish. The corpus consists of randomly sampled micro-blog posts from Twitter. The annotation guidelines are based on a careful review of the annotation practices of recent efforts for other languages. The corpus contains 36 232 tweets sampled randomly from the Twitter stream during a period of 18 months between Apr 2018 to Sept 2019. We found approximately 19 % of the tweets in the data contain some type of offensive language, which is further subcategorized based on the target of the offense. We describe the annotation process, discuss some interesting aspects of the data, and present results of automatically classifying the corpus using state-of-the-art text classification methods. The classifiers achieve 77.3 % F1 score on identifying offensive tweets, 77.9 % F1 score on determining whether a given offensive document is targeted or not, and 53.0 % F1 score on classifying the targeted offensive documents into three subcategories.

[1]  Liang Zou,et al.  NULI at SemEval-2019 Task 6: Transfer Learning for Offensive Language Detection using Bidirectional Transformers , 2019, *SEMEVAL.

[2]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[3]  Luis Gerardo Mojica Modeling Trolling in Social Media Conversations , 2016, LREC.

[4]  Paolo Rosso,et al.  Overview of the Task on Automatic Misogyny Identification at IberEval 2018 , 2018, IberEval@SEPLN.

[5]  M. Williams,et al.  Hate speech, machine classification and statistical modelling of information flows on Twitter: interpretation and communication for policy decision making , 2014 .

[6]  Ying Chen,et al.  Detecting Offensive Language in Social Media to Protect Adolescent Online Safety , 2012, 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing.

[7]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[8]  E. Yıldız,et al.  Representation of Syrian refugees in the Turkish media , 2018 .

[9]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[10]  Charlotte Gooskens,et al.  Gabmap – A web application for dialectology. , 2011 .

[11]  Ingmar Weber,et al.  Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[12]  Ritesh Kumar,et al.  Aggression-annotated Corpus of Hindi-English Code-mixed Data , 2018, LREC.

[13]  Asli Ozyurek,et al.  Shared information and argument omission in Turkish , 2007 .

[14]  Çagri Çöltekin,et al.  Fewer features perform well at Native Language Identification task , 2017, BEA@EMNLP.

[15]  Josef Ruppenhofer,et al.  Guidelines for IGGSA Shared Task on the Identification of Offensive Language , 2018 .

[16]  Björn Ross,et al.  Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.

[17]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[18]  Walter Daelemans,et al.  Detection and Fine-Grained Classification of Cyberbullying Events , 2015, RANLP.

[19]  Jing Zhou,et al.  Hate Speech Detection with Comment Embeddings , 2015, WWW.

[20]  Çağrı Çöltekin,et al.  Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation , 2019, Proceedings of the Sixth Workshop on.

[21]  Tim O'Shea,et al.  'Flaming' in computer-mediated communication: Observations, explanations, implications. , 1992 .

[22]  Henry Lieberman,et al.  Common Sense Reasoning for Detection, Prevention, and Mitigation of Cyberbullying , 2012, TIIS.

[23]  Çagri Çöltekin,et al.  Tübingen-Oslo Team at the VarDial 2018 Evaluation Campaign: An Analysis of N-gram Features in Language Variety Identification , 2018, VarDial@COLING 2018.

[24]  Bilge Yeşil,et al.  Online Surveillance in Turkey: Legislation, Technology and Citizen Involvement , 2017 .

[25]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[26]  Erkan Saka,et al.  Social Media in Turkey as a Space for Political Battles: AKTrolls and other Politically motivated trolling , 2018 .

[27]  Ashish Sureka,et al.  Using KNN and SVM Based One-Class Classifier for Detecting Online Radicalization on Twitter , 2015, ICDCIT.

[28]  Lei Gao,et al.  Detecting Online Hate Speech Using Context Aware Models , 2017, RANLP.

[29]  Amit P. Sheth,et al.  Cursing in English on twitter , 2014, CSCW.

[30]  Dolf Trieschnigg,et al.  Experts and Machines against Bullies: A Hybrid Approach to Detect Cyberbullies , 2014, Canadian Conference on AI.

[31]  Paolo Rosso,et al.  Overview of the Evalita 2018 Task on Automatic Misogyny Identification (AMI) , 2018, EVALITA@CLiC-it.

[32]  Evaluating the Regulation of Access to Online Content in Turkey in the Context of Freedom of Speech , 2013 .

[33]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[34]  J. Pennebaker,et al.  The sounds of social life: a psychometric analysis of students' daily social environments and natural conversations. , 2003, Journal of personality and social psychology.

[35]  Dolf Trieschnigg,et al.  Improving Cyberbullying Detection with User Context , 2013, ECIR.

[36]  Christian Biemann,et al.  Transfer Learning from LDA to BiLSTM-CNN for Offensive Language Detection in Twitter , 2018, ArXiv.

[37]  Kenji Araki,et al.  Detecting Cyberbullying Entries on Informal School Websites Based on Category Relevance Maximization , 2013, IJCNLP.

[38]  Naci Karkin,et al.  Twitter use by politicians during social uprisings: an analysis of Gezi park protests in Turkey , 2015, DG.O.

[39]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[40]  Njagi Dennis Gitari,et al.  A Lexicon-based Approach for Hate Speech Detection , 2015, MUE 2015.

[41]  A. Göksel,et al.  Turkish: A Comprehensive Grammar , 2004 .

[42]  Tom De Smedt,et al.  Right-wing German Hate Speech on Twitter: Analysis and Automatic Detection , 2019, ArXiv.

[43]  Dennis R. Durbin,et al.  Role of the Pediatrician in Youth Violence Prevention , 2009, Pediatrics.

[44]  Mai ElSherief,et al.  Hate Lingo: A Target-based Linguistic Analysis of Hate Speech in Social Media , 2018, ICWSM.

[45]  Felice Dell'Orletta,et al.  Hate Me, Hate Me Not: Hate Speech Detection on Facebook , 2017, ITASEC.

[46]  Michael Wiegand,et al.  Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language , 2018 .

[47]  Ellen Spertus,et al.  Smokey: Automatic Recognition of Hostile Messages , 1997, AAAI/IAAI.

[48]  John Nerbonne,et al.  Using Gabmap , 2022 .

[49]  J. M. Kayany Contexts of uninhibited online behavior: flaming in social newsgroups on Usenet , 1998 .

[50]  Joaquín Padilla Montani,et al.  GermEval 2018 : German Abusive Tweet Detection , 2018 .

[51]  Melih Kirlidog,et al.  Internet censorship in Turkey , 2015 .

[52]  Michael Wiegand,et al.  Overview of GermEval Task 2, 2019 Shared Task on the Identification of Offensive Language , 2019, KONVENS.

[53]  Walid Magdy,et al.  Abusive Language Detection on Arabic Social Media , 2017, ALW@ACL.

[54]  Yuzhou Wang,et al.  Locate the Hate: Detecting Tweets against Blacks , 2013, AAAI.

[55]  S. Ruhi,et al.  Conceptualizing face and relational work in (im)politeness : Revelations from politeness lexemes and idioms in Turkish , 2007 .

[56]  Paula Cristina Teixeira Fortuna,et al.  Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes , 2017 .

[57]  Cristina Bosco,et al.  An Impossible Dialogue! Nominal Utterances and Populist Rhetoric in an Italian Twitter Corpus of Hate Speech against Immigrants , 2018, LREC.

[58]  Hugo Jair Escalante,et al.  Overview of MEX-A3T at IberLEF 2019: Authorship and Aggressiveness Analysis in Mexican Spanish Tweets , 2018, IberLEF@SEPLN.

[59]  Alex Nikolov,et al.  Nikolov-Radivchev at SemEval-2019 Task 6: Offensive Tweet Classification with BERT and Ensembles , 2019, *SEMEVAL.

[60]  Çagri Çöltekin,et al.  Tübingen-Oslo at SemEval-2018 Task 2: SVMs perform better than RNNs in Emoji Prediction , 2018, SemEval@NAACL-HLT.

[61]  Julia Hirschberg,et al.  Detecting Hate Speech on the World Wide Web , 2012 .

[62]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[63]  Erdem Yörük,et al.  Mediatized Populisms| Digital Populism: Trolls and Political Polarization of Twitter in Turkey , 2017 .

[64]  Ritesh Kumar,et al.  Benchmarking Aggression Identification in Social Media , 2018, TRAC@COLING 2018.

[65]  Jun-Ming Xu,et al.  Learning from Bullying Traces in Social Media , 2012, NAACL.

[66]  Shervin Malmasi,et al.  Challenges in discriminating profanity from hate speech , 2017, J. Exp. Theor. Artif. Intell..

[67]  Indra Budi,et al.  A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media , 2018 .

[68]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[69]  Ashish Sureka,et al.  Characterizing Linguistic Attributes for Automatic Classification of Intent Based Racist/Radicalized Posts on Tumblr Micro-Blogging Website , 2017, ArXiv.