Hate Speech Detection in Roman Urdu

Hate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed manually. To address this issue, several studies have been conducted for hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making the social media vulnerable for millions of users. In particular, to the best of our knowledge, no study has been conducted for hate speech detection in Roman Urdu text, which is widely used in the sub-continent. In this study, we have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets. Subsequently, we have employed an iterative approach to develop guidelines and used them for generating the Hate Speech Roman Urdu 2020 corpus. The tweets in the this corpus are classified at three levels: Neutral-Hostile, Simple-Complex, and Offensive-Hate speech. As another contribution, we have used five supervised learning techniques, including a deep learning technique, to evaluate and compare their effectiveness for hate speech detection. The results show that Logistic Regression outperformed all other techniques, including deep learning techniques for the two levels of classification, by achieved an F1 score of 0.906 for distinguishing between Neutral-Hostile tweets, and 0.756 for distinguishing between Offensive-Hate speech tweets.

[1]  Tomoaki Ohtsuki,et al.  Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection , 2018, IEEE Access.

[2]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[3]  Daryl Essam,et al.  Sentiment Analysis for a Resource Poor Language—Roman Urdu , 2019, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[4]  Karsten Müller,et al.  Fanning the Flames of Hate: Social Media and Hate Crime , 2020, Journal of the European Economic Association.

[5]  Diana Maynard,et al.  Automatic Detection of Political Opinions in Tweets , 2011, #MSM.

[6]  Nazli Goharian,et al.  Hate speech detection: Challenges and solutions , 2019, PloS one.

[7]  Kennedy Ogada,et al.  Using Naïve Bayes Algorithm in detection of Hate Tweets , 2018 .

[8]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[9]  Kamran Malik,et al.  Sentiment Classification of Customer Reviews about Automobiles in Roman Urdu , 2018, Advances in Intelligent Systems and Computing.

[10]  Shervin Malmasi,et al.  Detecting Hate Speech in Social Media , 2017, RANLP.

[11]  Prasenjit Majumder,et al.  Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages , 2019, FIRE.

[12]  Walid Magdy,et al.  Abusive Language Detection on Arabic Social Media , 2017, ALW@ACL.

[13]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[14]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[15]  Jing Zhou,et al.  Hate Speech Detection with Comment Embeddings , 2015, WWW.

[16]  Mohib Ullah,et al.  Roman Urdu Opinion Mining System (RUOMiS) , 2015, ArXiv.

[17]  Amr Tolba,et al.  Automatic hate speech detection using killer natural language processing optimizing ensemble deep learning approach , 2019, Computing.

[18]  Ron Artstein,et al.  Inter-annotator Agreement , 2017 .

[19]  Marco Spruit,et al.  Comparing Deep Learning and Classical Machine Learning Approaches for Predicting Inpatient Violence Incidents from Clinical Text , 2018, Applied Sciences.

[20]  Udo Kruschwitz,et al.  Improving Hate Speech Detection with Deep Learning Ensembles , 2018, LREC.

[21]  Radhika Mamidi,et al.  When does a compliment become sexist? Analysis and classification of ambivalent sexism using twitter data , 2017, NLP+CSS@ACL.

[22]  Saif Mohammad,et al.  A Practical Guide to Sentiment Annotation: Challenges and Solutions , 2016, WASSA@NAACL-HLT.

[23]  David Robinson,et al.  Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network , 2018, ESWC.

[24]  Aditya Gaydhani,et al.  Detecting Hate Speech and Offensive Language on Twitter using Machine Learning: An N-gram and TFIDF based Approach , 2018, ArXiv.

[25]  Dilip Kumar Sharma,et al.  A Review on Offensive Language Detection , 2020 .

[26]  Vasudeva Varma,et al.  Deep Learning for Hate Speech Detection in Tweets , 2017, WWW.

[27]  Amit P. Sheth,et al.  Cursing in English on twitter , 2014, CSCW.

[28]  Felice Dell'Orletta,et al.  Hate Me, Hate Me Not: Hate Speech Detection on Facebook , 2017, ITASEC.

[29]  A. Al-Hassan,et al.  DETECTION OF HATE SPEECH IN SOCIAL NETWORKS: A SURVEY ON MULTILINGUAL CORPUS , 2019, Computer Science & Information Technology(CS & IT).

[30]  Sérgio Nunes,et al.  A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..

[31]  Naveed Sarfraz Khattak,et al.  Speaker Independent Urdu speech recognition using HMM , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[32]  Ali Daud,et al.  Urdu language processing: a survey , 2017, Artificial Intelligence Review.

[33]  Heri Ramampiaro,et al.  Effective hate-speech detection in Twitter data using recurrent neural networks , 2018, Applied Intelligence.

[34]  Hate Speech Detection Using Natural Language Processing Techniques , 2018 .

[35]  Peng Sun,et al.  Deep Learning vs. Classical Machine Learning: A Comparison of Methods for Fluid Intelligence Prediction , 2019, ABCD-NP@MICCAI.

[36]  Hazem M. Abbas,et al.  Combining Classical and Deep Learning Methods for Twitter Sentiment Analysis , 2018, ANNPR.

[37]  U. Hahn,et al.  At the Lower End of Language—Exploring the Vulgar and Obscene Side of German , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[38]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[39]  Walter Daelemans,et al.  Automatic detection of cyberbullying in social media text , 2018, PloS one.

[40]  Heri Ramampiaro,et al.  Effective hate-speech detection in Twitter data using recurrent neural networks , 2018, Applied Intelligence.

[41]  Mohammad Matin,et al.  Urdu character recognition using fourier descriptors for optical networks , 2005, SPIE Optics + Photonics.