WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments

This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in IndoEuropean Languages (HASOC) shared task 2020. The HASOC 2020 organizers provided participants with annotated datasets containing social media posts of code-mixed in Dravidian languages (Malayalam-English and Tamil-English). We participated in task 1: Offensive comment identification in Code-mixed Malayalam Youtube comments. In our methodology, we take advantage of available English data by applying cross-lingual contextual word embeddings and transfer learning to make predictions to Malayalam data. We further improve the results using various fine tuning strategies. Our system achieved 0.89 weighted average F1 score for the test set and it ranked 5th place out of 12 participants.

[1]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[2]  Shervin Malmasi,et al.  Detecting Hate Speech in Social Media , 2017, RANLP.

[3]  Marcos Zampieri,et al.  Offensive Language Identification in Greek , 2020, LREC.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Shervin Malmasi,et al.  Evaluating Aggression Identification in Social Media , 2020, TRAC.

[6]  Chao Zhang,et al.  BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant Supervision , 2020, KDD.

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Marcos Zampieri,et al.  Multilingual Offensive Language Identification with Cross-lingual Embeddings , 2020, EMNLP.

[9]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[10]  Ritesh Kumar,et al.  Benchmarking Aggression Identification in Social Media , 2018, TRAC@COLING 2018.

[11]  Xipeng Qiu,et al.  Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation , 2020, Journal of Computer Science and Technology.

[12]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[13]  Jimmy J. Lin,et al.  End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.

[14]  Daphney-Stavroula Zois,et al.  Cyberbullying Detection on Instagram with Optimal Online Feature Selection , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[15]  Tharindu Ranasinghe,et al.  BRUMS at SemEval-2020 Task 3: Contextualised Embeddings forPredicting the (Graded) Effect of Context in Word Similarity , 2020, SemEval@COLING.

[16]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[17]  Marcos Zampieri,et al.  BRUMS at HASOC 2019: Deep Learning Models for Multilingual Hate Speech and Offensive Language Identification , 2019, FIRE.

[18]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[19]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[20]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[21]  S. Bauman,et al.  Associations among bullying, cyberbullying, and suicide in high school students. , 2013, Journal of adolescence.

[22]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[23]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[24]  Tharindu Ranasinghe,et al.  InfoMiner at WNUT-2020 Task 2: Transformer-based Covid-19 Informative Tweet Extraction , 2020, WNUT.

[25]  Gianluca Stringhini,et al.  Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior , 2018, ICWSM.

[26]  Ralf Krestel,et al.  Delete or not Delete? Semi-Automatic Comment Moderation for the Newsroom , 2018, TRAC@COLING 2018.

[27]  Daphney-Stavroula Zois,et al.  Mining Patterns of Cyberbullying on Twitter , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[28]  Preslav Nakov,et al.  SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) , 2019, *SEMEVAL.

[29]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[30]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[31]  Tharindu Ranasinghe,et al.  BRUMS at SemEval-2020 Task 12: Transformer Based Multilingual Offensive Language Identification in Social Media , 2020, SEMEVAL.

[32]  Animesh Mukherjee,et al.  Temporal effects of Unmoderated Hate speech in Gab , 2019, ArXiv.

[33]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[34]  Tharindu Ranasinghe,et al.  Emoji Powered Capsule Network to Detect Type and Target of Offensive Posts in Social Media , 2019, RANLP.