Exploratory Arabic Offensive Language Dataset Analysis

This paper adding more insights towards resources and datasets used in Arabic offensive language research. The main goal of this paper is to guide researchers in Arabic offensive language in selecting appropriate datasets based on their content, and in creating new Arabic offensive language resources to support and complement the available ones. Introduction Annotated offensive language datasets are used to categorize texts according to their offensive content automatically. As it is mentioned previously, some examples of offensive content are hate speech, obscene language, or vulgar language. The automated categorization process is called text classification, which depends heavily on the availability and the quality of the dataset used in building the classification model. The offensive language datasets are a critical factor to the growth and success of the online offensive language detection systems. Multiple attributes effect the quality of datasets, such as the size, the annotation process, and the source. High quality datasets provide valuable data insights and support the classification model to learn effectively. To pursue the goal of this paper, several available open-source datasets are surveyed from across the Arabic offensive language datasets to provide a comprehensive overview by conducting in-depth Exploratory Data Analysis (EDA). The EDA includes a statistical analysis, a textual analysis, and a contextual analysis for all datasets to investigate the content from multiple dimensions. Some visualization tools are used to better understand the content and context of the data used. The study ends-up with a summary of the results to synthesis the main findings. The scope of this paper covers the following research questions: What are the content of the available Arabic offensive language datasets? What are the limitations of the available Arabic offensive language datasets? How can we complement the available Arabic offensive language datasets to contribute to text classification systems? The paper is organized in four main sections. The methodology is described in detail in the first section. The second section presents the results and the third section builds on top of the second one by discussing and synthesizing the results. In the last section, conclusions and design considerations are presented. Methodology Four main phases are followed during the survey process. Starting by selecting datasets, formatting datasets, analyzing datasets, and ending by summarizing and synthesizing the results. The following paragraphs describe each phase of the methodology in detail. 1) Selecting Datasets: A set of criteria are defined to select the datasets: searching, formatting, and accessibility. These criteria ensure the quality of the study. a. Defining Searching Criteria Datasets related to offensive language are included, such as hate speech, vulgar, or abusive. Only Arabic language datasets are considered, including dialectic Arabic. b. Defining Formatting Criteria Datasets from multiple formats were included. Most datasets are in Comma-Separated Values (CSV) file format, few of them are in Excel, Tab-Separated Values (TSV), and JavaScript Object Notation (JSON). c. Defining Accessibility Criteria Datasets that have been released freely online with open-source option are considered only. 2) Formatting Datasets: The selected datasets are in heterogeneous formats and some of them include multiple descriptive attributes, such as publishing date, user profile, or number of annotators. Thus, we process them to be in a minimal and consistent format. a. Filtering Attributes We remove all unnecessary attributes that do not serve the goal of the study. Only textual messages and labels were included. The content of textual messages was intentionally kept without cleaning because all content is considered for analysis purposes, however, some datasets were provided in preprocessed format only. b. Creating CSV Files For each dataset, we create a CSV file to save the textual messages and labels only. This file is used for cross labels analysis and for overall dataset analysis. c. Creating Textual Files For each label within the datasets, we create a text file that contains only the textual content. This file is used for textual analysis and contextual analysis purposes. 3) Analyzing Datasets: This is the most important phase of the study. The analysis phase adds value and insight about the content of the datasets. We present detailed investigations for the content of each dataset by conducting statistical, textual, and contextual analysis, in addition to generating multiple graphs to visualize the content. a. Statistical analysis The statistical analysis includes finding frequencies of words, frequencies of stop words, statistical measurements for the lengths of the text based on the number of tokens, and statistical measurements for the lengths of the tokens based on the number of characters to analyze their relationships with offensive content. To extract the most frequently used words for each class accurately, we remove a list of stop words from the text. The stop words list includes the NLTK Arabic stop words list, and Albadi, Kurdi, and Mishra (2018)’s stop words list. Then, we search for the words that have the prefix 'لا' to remove the prefix. We do not remove the prefix 'لا' when it is used as a part of the word and not as a prefix, such as in the word “الله”. Simple count of token frequencies is useful to compare among multiple classes; however, it does not provide rich information about each class separately. We use the web-based tool Voyant to further analyze the text and identify the top five most distinctive words of each class. Stop words could help in defining the context of the posts. We conduct simple frequency analysis to generate the top stop words per class, as stop words that appear only in a particular class might be better to consider in analysis as a regular word rather than as a stop word. We investigate the complexity of the text used in each class to check if there is any pattern or relationship between the complexity of the text used and the type of the offensive content. We use two measures to peruse the goal of this analysis; the number of characters per token and the number of tokens per post. 1 https://voyant-tools.org/ b. Textual analysis Before conducting any cleaning or filtering techniques to the data, we generate word cloud graphs for each label from each dataset using the textual files to give some intuition about the raw content of each class. Data in all datasets are extracted from user-generated content platforms that is usually written in unstructured format and using dialectic Arabic, which is not supported by most of the available textual analysis tools. Thus, we were unable to perform POS Tagging to analyze the text based on their functional roles, and investigate whether that could influence the offensive content. c. Contextual analysis We study the impact of context to offensive content. Context is defined in terms of text sentiment, the use of emojis, and the use of punctuations. To better understand the context of the samples, We use the Mazajak online tool for Arabic sentiment analysis to predict the sentiments of tweets. Thus, each sample is classified to positive, negative, or neutral depends on its content. Emoji is often used in online communication to reflect emotion and express personality, thus, considering emojis adds value to understanding text. Punctuations provide clue for the meaning of unfamiliar phrases and context of the sentence. As a result of that we analyze the use of punctuations and their effects on offensive content. 2 mazajak.inf.ed.ac.uk 4) Summarizing and Synthesizing Results: After reviewing the analysis section, we connect results across the datasets and summarize the overall findings. We add more insight into the findings by synthesizing the result with findings from previous studies, and provide valuable design considerations for other researchers in the same domain of research. Datasets Analysis Results: This section contains the results from dataset analysis in chronological order based on the publication date of each dataset. A total of nine datasets satisfy the selection criteria as the following: Aljazeera.net Deleted Comments (Mubarak, Darwish, and Magdy, 2017), Egyptian Tweets (Mubarak, Darwish, and Magdy, 2017), YouTube Comments (Alakrot, Murray, and Nikolov, 2018), Religious Hate Speech (Albadi, Kurdi, and Mishra, 2018), Levantine Hate Speech and Abusive Language (Mulki et al., 2019), Tunisian Hate Speech and Abusive Language (Haddad, Mulki, and Oueslati, 2019), Multi-Platform Offensive Language Dataset (Chowdhury et al., 2020), the Fourth Workshop on Open-Source Arabic Corpora and Corpora Processing Tools (Mubarak et al., 2020), and the Multi-Platform Hate Speech Dataset (Omar, Mahmoud, & Abd ElHafeez, 2020). 1) The Aljazeera.net Deleted Comments Dataset: The Aljazeera.net deleted comments datasets is developed by Mubarak, Darwish, and Magdy (2017). It includes a total of 31,692 comments. Three classes are used to label the comments as the following: 5,653 clean comments, 533 obscene comments, and 25,506 offensive comments. Figure 1 shows classes distribution. Figure 1 Class distribution for the Aljazeera dataset The total number of duplicate comments is 8; 2 clean comments, 1 obscene comments, and 5 offensive comments. The following is an example from the duplicated offensive comments: مھب لعف امك مھریھطت بجی نیناریلااو ضفورلا نم نیدلا راجت .. مكلاعن تحت ضرلأا سوبأو مكیدایأ ىلع دشأ مكیدانأ مكیدانأ للاتحلاا نم ھیبرعلا يضارلاا ررحتت فوس اھدعب . نیدلا حلاص يلیئارسلااو يسرافلا Translation: I am calling you, I am calling you and hold your hands and kiss the land beneath your shoes.. The land need to be cleaned from the Iranian and Shia as Salah Al-Deen did before, after that the Arabic land will get free from the Persian and Israeli