Uncovering Political Hate Speech During Indian Election Campaign: A New Low-Resource Dataset and Baselines

The detection of hate speech in political discourse is a critical issue, and this becomes even more challenging in low-resource languages. To address this issue, we introduce a new dataset named IEHate, which contains 11,457 manually annotated Hindi tweets related to the Indian Assembly Election Campaign from November 1, 2021, to March 9, 2022. We performed a detailed analysis of the dataset, focusing on the prevalence of hate speech in political communication and the different forms of hateful language used. Additionally, we benchmark the dataset using a range of machine learning, deep learning, and transformer-based algorithms. Our experiments reveal that the performance of these models can be further improved, highlighting the need for more advanced techniques for hate speech detection in low-resource languages. In particular, the relatively higher score of human evaluation over algorithms emphasizes the importance of utilizing both human and automated approaches for effective hate speech moderation. Our IEHate dataset can serve as a valuable resource for researchers and practitioners working on developing and evaluating hate speech detection techniques in low-resource languages. Overall, our work underscores the importance of addressing the challenges of identifying and mitigating hate speech in political discourse, particularly in the context of low-resource languages. The dataset and resources for this work are made available at https://github.com/Farhan-jafri/Indian-Election.

[1]  Animesh Mukherjee,et al.  Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages , 2022, HT.

[2]  Raviraj Joshi,et al.  Hate and Offensive Speech Detection in Hindi and Marathi , 2021, FIRE.

[3]  A. Parihar,et al.  Hate Speech Detection Using Natural Language Processing: Applications and Challenges , 2021, 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI).

[4]  Djamila Romaissa Beddiar,et al.  Data Expansion using Back Translation and Paraphrasing for Hate Speech Detection , 2021, Online Soc. Networks Media.

[5]  Luis Espinosa Anke,et al.  XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond , 2021, LREC.

[6]  Fabrício Benevenuto,et al.  HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection , 2021, LREC.

[7]  Prasenjit Majumder,et al.  Detecting and visualizing hate speech in social media: A cyber Watchdog for surveillance , 2020, Expert Syst. Appl..

[8]  E. Zhuravskaya,et al.  Political Effects of the Internet and Social Media , 2019, Annual Review of Economics.

[9]  Taha Yasseri,et al.  Detecting weak and strong Islamophobic hate speech on social media , 2018, Journal of Information Technology & Politics.

[10]  Ritesh Kumar,et al.  Aggression-annotated Corpus of Hindi-English Code-mixed Data , 2018, LREC.

[11]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[12]  J. Stoltzfus,et al.  Logistic regression: a brief primer. , 2011, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[13]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[14]  I. Razzak,et al.  A Multi-Modal Dataset for Hate Speech Detection on Social Media: Case-study of Russia-Ukraine Conflict , 2022, CASE.

[15]  K. P. Soman,et al.  Detection of Hate Speech Text in Hindi-English Code-mixed Data , 2020 .

[16]  Hatem Haddad,et al.  L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[17]  Vinay Singh,et al.  A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection , 2018, PEOPLES@NAACL-HTL.

[18]  Madhya Pradesh,et al.  Census of India 2011 , 2011 .

[19]  Vili Podgorelec,et al.  Decision trees , 2018, Encyclopedia of Database Systems.

[20]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[21]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .