OSACT4 Shared Task on Offensive Language Detection: Intensive Preprocessing-Based Approach

The preprocessing phase is one of the key phases within the text classification pipeline. This study aims at investigating the impact of the preprocessing phase on text classification, specifically on offensive language and hate speech classification for Arabic text. The Arabic language used in social media is informal and written using Arabic dialects, which makes the text classification task very complex. Preprocessing helps in dimensionality reduction and removing useless content. We apply intensive preprocessing techniques to the dataset before processing it further and feeding it into the classification model. An intensive preprocessing-based approach demonstrates its significant impact on offensive language detection and hate speech detection shared tasks of the fourth workshop on Open-Source Arabic Corpora and Corpora Processing Tools (OSACT). Our team wins the third place (3rd) in the Sub-Task A Offensive Language Detection division and wins the first place (1st) in the Sub-Task B Hate Speech Detection division, with an F1 score of 89% and 95%, respectively, by providing the state-of-the-art performance in terms of F1, accuracy, recall, and precision for Arabic hate speech detection.

[1]  Yuzhou Wang,et al.  Locate the Hate: Detecting Tweets against Blacks , 2013, AAAI.

[2]  Ramit Sawhney,et al.  ARHNet - Leveraging Community Interaction for Detection of Religious Hate Speech in Arabic , 2019, ACL.

[3]  Ponnurangam Kumaraguru,et al.  Mind Your Language: Abuse and Offense Detection for Code-Switched Languages , 2018, AAAI.

[4]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[5]  I. Lali,et al.  Early Detection of Controversial Urdu Speeches from Social Media , 2018 .

[6]  Leon Derczynski,et al.  Offensive Language and Hate Speech Detection for Danish , 2019, LREC.

[7]  Selma Ayşe Özel,et al.  Detection of cyberbullying on social media messages in Turkish , 2017, 2017 International Conference on Computer Science and Engineering (UBMK).

[8]  Nikola S. Nikolov,et al.  Towards Accurate Detection of Offensive Language in Online Communication in Arabic , 2018, ACLING.

[9]  James H. Jones,et al.  A Statistical Learning Approach to Detect Abusive Twitter Accounts , 2017, ICCDA '17.

[10]  Constantin Orasan,et al.  Aggressive Language Identification Using Word Embeddings and Sentiment Features , 2018, TRAC@COLING 2018.

[11]  Ehab Abozinadah,et al.  Detecting Abusive Arabic Language Twitter Accounts Using a Multidimensional Analysis Model , 2017 .

[12]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[13]  Nikola S. Nikolov,et al.  Detecting Offensive Language on Arabic Social Media Using Deep Learning , 2019, 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[14]  Heri Ramampiaro,et al.  Effective hate-speech detection in Twitter data using recurrent neural networks , 2018, Applied Intelligence.

[15]  Nikola S. Nikolov,et al.  Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic , 2018, ACLING.

[16]  Shivakant Mishra,et al.  Investigating the effect of combining GRU neural networks with handcrafted features for religious hatred detection on Arabic Twitter space , 2019, Social Network Analysis and Mining.

[17]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[18]  Shivakant Mishra,et al.  International Conference on Advances in Social Networks Analysis and Mining ( ASONAM ) Are They Our Brothers ? Analysis and Detection of Religious Hate Speech in the Arabic Twittersphere , 2018 .

[19]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[20]  Vinay Singh,et al.  A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection , 2018, PEOPLES@NAACL-HTL.

[21]  James H. Jones,et al.  Detection of Abusive Accounts with Arabic Tweets , 2022 .

[22]  Walid Magdy,et al.  Abusive Language Detection on Arabic Social Media , 2017, ALW@ACL.

[23]  Christian Biemann,et al.  Transfer Learning from LDA to BiLSTM-CNN for Offensive Language Detection in Twitter , 2018, ArXiv.