GruPaTo at SemEval-2020 Task 12: Retraining mBERT on Social Media and Fine-tuned Offensive Language Models

We introduce an approach to multilingual Offensive Language Detection based on the mBERT transformer model. We download extra training data from Twitter in English, Danish, and Turkish, and use it to re-train the model. We then fine-tuned the model on the provided training data and, in some configurations, implement transfer learning approach exploiting the typological relatedness between English and Danish. Our systems obtained good results across the three languages (.9036 for EN, .7619 for DA, and .7789 for TR).

[1]  Alexei Bastidas,et al.  Technology Solutions to Combat Online Harassment , 2017, ALW@ACL.

[2]  Preslav Nakov,et al.  A Large-Scale Semi-Supervised Dataset for Offensive Language Identification , 2020, ArXiv.

[3]  Björn Gambäck,et al.  Studying Generalisability across Abusive Language Detection Datasets , 2019, CoNLL.

[4]  Hans Uszkoreit,et al.  Language Technology 2012: Current State and Opportunities , 2013 .

[5]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[6]  Joaquín Padilla Montani,et al.  GermEval 2018 : German Abusive Tweet Detection , 2018 .

[7]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[8]  Scott A. Hale,et al.  Challenges and frontiers in abusive content detection , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[9]  Kathleen McKeown,et al.  Predictive Embeddings for Hate Speech Detection on Twitter , 2018, ALW.

[10]  Jan Snajder,et al.  Cross-Domain Detection of Abusive Language Online , 2018, ALW.

[11]  Salikoko S. Mufwene,et al.  Language as technology , 2013 .

[12]  Michael Wiegand,et al.  Inducing a Lexicon of Abusive Words – a Feature-Based Approach , 2018, NAACL.

[13]  Maria Koptjevskaja-Tamm,et al.  Linguistic Typology , 2017 .

[14]  Michael Granitzer,et al.  nlpUP at SemEval-2019 Task 6: A Deep Neural Language Model for Offensive Language Detection , 2019, *SEMEVAL.

[15]  Eva Schlinger,et al.  How Multilingual is Multilingual BERT? , 2019, ACL.

[16]  Stefano Lusso,et al.  OCCAM: a flexible, multi-purpose and extendable HPC cluster , 2017, ArXiv.

[17]  Çağrı Çöltekin,et al.  A Corpus of Turkish Offensive Language on Social Media , 2020, LREC.

[18]  Preslav Nakov,et al.  SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , 2020, SemEval@COLING.

[19]  Cvetana Krstev,et al.  Using Lexical Resources for Irony and Sarcasm Classification , 2017, BCI.

[20]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[21]  Leon Derczynski,et al.  Offensive Language and Hate Speech Detection for Danish , 2019, LREC.

[22]  Giovanni Semeraro,et al.  AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets , 2019, CLiC-it.

[23]  Tommaso Caselli,et al.  I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language , 2020, LREC.

[24]  Liang Zou,et al.  NULI at SemEval-2019 Task 6: Transfer Learning for Offensive Language Detection using Bidirectional Transformers , 2019, *SEMEVAL.

[25]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[26]  Michael Meeuwis,et al.  Order of subject, object, and verb , 2013 .

[27]  Felice Dell'Orletta,et al.  Multi-task Learning in Deep Neural Networks at EVALITA 2018 , 2018, EVALITA@CLiC-it.

[28]  Viviana Patti,et al.  Hurtlex: A Multilingual Lexicon of Words to Hurt , 2018, CLiC-it.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.