PRHLT-UPV at SemEval-2020 Task 12: BERT for Multilingual Offensive Language Detection

The present paper describes the system submitted by the PRHLT-UPV team for the task 12 of SemEval-2020: OffensEval 2020. The official title of the task is Multilingual Offensive Language Identification in Social Media, and aims to identify offensive language in texts. The languages included in the task are English, Arabic, Danish, Greek and Turkish. We propose a model based on the BERT architecture for the analysis of texts in English. The approach leverages knowledge within a pre-trained model and performs fine-tuning for the particular task. In the analysis of the other languages the Multilingual BERT is used, which has been pre-trained for a large number of languages. In the experiments, the proposed method for English texts is compared with other approaches to analyze the relevance of the architecture used. Furthermore, simple models for the other languages are evaluated to compare them with the proposed one. The experimental results show that the model based on BERT outperforms other approaches. The main contribution of this work lies in this study, despite not obtaining the first positions in most cases of the competition ranking.

[1]  Leon Derczynski,et al.  Directions in Abusive Language Training Data: Garbage In, Garbage Out , 2020, ArXiv.

[2]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[3]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[4]  Hamdy Mubarak,et al.  Arabic Offensive Language on Twitter: Analysis and Experiments , 2020, WANLP.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[7]  Dan Roth,et al.  Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[8]  Felice Dell'Orletta,et al.  Overview of the EVALITA 2018 Hate Speech Detection Task , 2018, EVALITA@CLiC-it.

[9]  Ritesh Kumar,et al.  Benchmarking Aggression Identification in Social Media , 2018, TRAC@COLING 2018.

[10]  Preslav Nakov,et al.  SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification , 2020, FINDINGS.

[11]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[12]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[13]  Preslav Nakov,et al.  SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , 2020, SEMEVAL.

[14]  Óscar Garibo i Orts Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter at SemEval-2019 Task 5: Frequency Analysis Interpolation for Hate in Speech Detection , 2019, *SEMEVAL.

[15]  Leon Derczynski,et al.  Offensive Language and Hate Speech Detection for Danish , 2019, LREC.

[16]  Prasenjit Majumder,et al.  Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages , 2019, FIRE.

[17]  Marco Guerini,et al.  Generating Counter Narratives against Online Hate Speech: Data and Strategies , 2020, ACL.

[18]  Marcos Zampieri,et al.  Offensive Language Identification in Greek , 2020, LREC.

[19]  Shervin Malmasi,et al.  Challenges in discriminating profanity from hate speech , 2017, J. Exp. Theor. Artif. Intell..

[20]  Ingmar Weber,et al.  Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[21]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Çağrı Çöltekin,et al.  A Corpus of Turkish Offensive Language on Social Media , 2020, LREC.