论文信息 - PRHLT-UPV at SemEval-2020 Task 12: BERT for Multilingual Offensive Language Detection

PRHLT-UPV at SemEval-2020 Task 12: BERT for Multilingual Offensive Language Detection

The present paper describes the system submitted by the PRHLT-UPV team for the task 12 of SemEval-2020: OffensEval 2020. The official title of the task is Multilingual Offensive Language Identification in Social Media, and aims to identify offensive language in texts. The languages included in the task are English, Arabic, Danish, Greek and Turkish. We propose a model based on the BERT architecture for the analysis of texts in English. The approach leverages knowledge within a pre-trained model and performs fine-tuning for the particular task. In the analysis of the other languages the Multilingual BERT is used, which has been pre-trained for a large number of languages. In the experiments, the proposed method for English texts is compared with other approaches to analyze the relevance of the architecture used. Furthermore, simple models for the other languages are evaluated to compare them with the proposed one. The experimental results show that the model based on BERT outperforms other approaches. The main contribution of this work lies in this study, despite not obtaining the first positions in most cases of the competition ranking.

Paolo Rosso | Gretel Liz De la Peña Sarracén | Paolo Rosso

[1] Leon Derczynski,et al. Directions in Abusive Language Training Data: Garbage In, Garbage Out , 2020, ArXiv.

[2] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[3] Preslav Nakov,et al. Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[4] Hamdy Mubarak,et al. Arabic Offensive Language on Twitter: Analysis and Experiments , 2020, WANLP.

[5] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[6] Mark Dredze,et al. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[7] Dan Roth,et al. Cross-Lingual Ability of Multilingual BERT: An Empirical Study , 2019, ICLR.

[8] Felice Dell'Orletta,et al. Overview of the EVALITA 2018 Hate Speech Detection Task , 2018, EVALITA@CLiC-it.

[9] Ritesh Kumar,et al. Benchmarking Aggression Identification in Social Media , 2018, TRAC@COLING 2018.

[10] Preslav Nakov,et al. SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification , 2020, FINDINGS.

[11] Paolo Rosso,et al. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.