MUDES: Multilingual Detection of Offensive Spans

The interest in offensive content identification in social media has grown substantially in recent years. Previous work has dealt mostly with post level annotations. However, identifying offensive spans is useful in many ways. To help coping with this important challenge, we present MUDES, a multilingual system to detect offensive spans in texts. MUDES features pre-trained models, a Python API for developers, and a user-friendly web-based interface. A detailed description of MUDES’ components is presented in this paper.

[1]  Ahmed Abdelali,et al.  Arabic Offensive Language on Twitter: Analysis and Experiments , 2020, ArXiv.

[2]  Marcos Zampieri,et al.  WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans , 2021, International Workshop on Semantic Evaluation.

[3]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[4]  Ifeoma Nwogu,et al.  WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments , 2020, FIRE.

[5]  Alexandros Karatzoglou,et al.  Getting Deep Recommenders Fit: Bloom Embeddings for Sparse Binary Input/Output Networks , 2017, RecSys.

[6]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[7]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[8]  Jungyun Seo,et al.  NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer , 2020, SEMEVAL.

[9]  Cristina Bosco,et al.  Hate Speech Annotation: Analysis of an Italian Twitter Corpus , 2017, CLiC-it.

[10]  Marcos Zampieri,et al.  Offensive Language Identification in Greek , 2020, LREC.

[11]  Marcos Zampieri,et al.  BRUMS at HASOC 2019: Deep Learning Models for Multilingual Hate Speech and Offensive Language Identification , 2019, FIRE.

[12]  Çağrı Çöltekin,et al.  A Corpus of Turkish Offensive Language on Social Media , 2020, LREC.

[13]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[14]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[15]  Arunkumar Bagavathi,et al.  Detecting Online Hate Speech: Approaches Using Weak Supervision and Network Embedding Models , 2020, SBP-BRiMS.

[16]  Sérgio Nunes,et al.  A Hierarchically-Labeled Portuguese Hate Speech Dataset , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[17]  Prasenjit Majumder,et al.  Filtering Aggression from the Multilingual Social Media Feed , 2018, TRAC@COLING 2018.

[18]  Ritesh Kumar,et al.  Benchmarking Aggression Identification in Social Media , 2018, TRAC@COLING 2018.

[19]  Preslav Nakov,et al.  SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification , 2020, FINDINGS.

[20]  Marcos Zampieri,et al.  Multilingual Offensive Language Identification with Cross-lingual Embeddings , 2020, EMNLP.

[21]  Daphney-Stavroula Zois,et al.  Cyberbullying Ends Here: Towards Robust Detection of Cyberbullying in Social Media , 2019, WWW.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[24]  Yu Sun,et al.  Galileo at SemEval-2020 Task 12: Multi-lingual Learning for Offensive Language Identification Using Pre-trained Language Models , 2020, SEMEVAL.

[25]  Shervin Malmasi,et al.  Evaluating Aggression Identification in Social Media , 2020, TRAC.

[26]  Abhishek Kumar,et al.  Multilingual and Multitarget Hate Speech Detection in Tweets , 2019, JEPTALNRECITAL.

[27]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28]  Maggie Cech macech at SemEval-2021 Task 5: Toxic Spans Detection , 2021, SEMEVAL.

[29]  Marcos Zampieri,et al.  Multilingual Offensive Language Identification for Low-resource Languages , 2021, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[30]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[31]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[32]  Dolf Trieschnigg,et al.  Improving Cyberbullying Detection with User Context , 2013, ECIR.

[33]  Shervin Malmasi,et al.  Challenges in discriminating profanity from hate speech , 2017, J. Exp. Theor. Artif. Intell..

[34]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[35]  Ricardo Ribeiro,et al.  Automatic cyberbullying detection: A systematic review , 2019, Comput. Hum. Behav..

[36]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[37]  Jorg Tiedemann,et al.  LT@Helsinki at SemEval-2020 Task 12: Multilingual or Language-specific BERT? , 2020, SEMEVAL.

[38]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[39]  Tharindu Ranasinghe,et al.  Emoji Powered Capsule Network to Detect Type and Target of Offensive Posts in Social Media , 2019, RANLP.

[40]  Walter Daelemans,et al.  A Dictionary-based Approach to Racism Detection in Dutch Social Media , 2016, ArXiv.

[41]  Leon Derczynski,et al.  Offensive Language and Hate Speech Detection for Danish , 2019, LREC.

[42]  Shervin Malmasi,et al.  Detecting Hate Speech in Social Media , 2017, RANLP.

[43]  Tharindu Ranasinghe,et al.  InfoMiner at WNUT-2020 Task 2: Transformer-based Covid-19 Informative Tweet Extraction , 2020, WNUT.

[44]  Tharindu Ranasinghe,et al.  BRUMS at SemEval-2020 Task 12: Transformer Based Multilingual Offensive Language Identification in Social Media , 2020, SEMEVAL.

[45]  Preslav Nakov,et al.  SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , 2020, SEMEVAL.

[46]  Preslav Nakov,et al.  Fine-Grained Analysis of Propaganda in News Article , 2019, EMNLP.

[47]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.