Abusive and Threatening Language Detection in Urdu using Boosting Based and BERT Based Models: A Comparative Approach

Online hatred is a growing concern on many social media platforms. To address this issue, different social media platforms have introduced moderation policies for such content. They also employ moderators who can check the posts violating moderation policies and take appropriate action. Academicians in the abusive language research domain also perform various studies to detect such content better. Although there is extensive research in abusive language detection in English, there is a lacuna in abusive language detection in low resource languages like Hindi, Urdu etc. In this FIRE 2021 shared task “HASOC Abusive and Threatening language detection in Urdu” the organisers propose an abusive language detection dataset in Urdu along with threatening language detection. In this paper, we explored several machine learning models such as XGboost, LGBM, m-BERT based models for abusive and threatening content detection in Urdu based on the shared task. We observed the Transformer model specifically trained on abusive language dataset in Arabic helps in getting the best performance. Our model came First for both abusive and threatening content detection with an F1score of 0.88 and 0.54, respectively. We have made our code public .

[1]  Animesh Mukherjee,et al.  You too Brutus! Trapping Hateful Users in Social Media: Challenges, Solutions & Insights , 2021, HT.

[2]  Heri Ramampiaro,et al.  Effective hate-speech detection in Twitter data using recurrent neural networks , 2018, Applied Intelligence.

[3]  Arkaitz Zubiaga,et al.  Threatening Language Detection and Target Identification in Urdu Tweets , 2021, IEEE Access.

[4]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[5]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[6]  Sai Saket Aluru,et al.  Deep Learning Models for Multilingual Hate Speech Detection , 2020, ArXiv.

[7]  Ingmar Weber,et al.  Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[8]  Animesh Mukherjee,et al.  HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection , 2020, AAAI.

[9]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[10]  Paolo Rosso,et al.  Hate Speech Detection Using Attention-based LSTM , 2018, EVALITA@CLiC-it.

[11]  J. S. Vedeler,et al.  Hate speech harms: a social justice discussion of disabled Norwegians’ experiences , 2019, Disability & Society.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Animesh Mukherjee,et al.  Hate speech in online social media , 2020, SIGWEB Newsl..

[14]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[15]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[16]  Prasenjit Majumder,et al.  Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages , 2019, FIRE.

[17]  Alexander Gelbukh,et al.  UrduThreat@ FIRE2021: Shared Track on Abusive Threat Identification in Urdu , 2021, FIRE.

[18]  Muhammad Abdul-Mageed,et al.  Understanding and Detecting Dangerous Speech in Social Media , 2020, OSACT.

[19]  Animesh Mukherjee,et al.  Spread of Hate Speech in Online Social Media , 2018, WebSci.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.