A Saudi Dialect Twitter Corpus for Sentiment and Emotion Analysis

In this paper, we introduce the Saudi Dialects Twitter Corpus (SDTC), comprising 5,400 tweets of Saudi dialects and Modern Standard Arabic classified for both sentiment analysis and emotion analysis. Three raters were engaged in the classification process, where they labeled each tweet according to its polarity (positive, negative, neutral, objective, spam, and not sure) and the emotion it carries using Ekman basic emotions (anger, fear, disgust, sadness, happiness, surprise, no emotion, and not sure). The data show comparable kappa and Fleiss’ kappa values for both polarity and emotion classification. The average agreement among any two raters was 65%, the average kappa for any two raters was 0.55, and Fleiss’ kappa for the three raters was 0.55. These values for kappa and Fleiss’ kappa indicate a moderate agreement. The values of kappa and Fleiss’ kappa statistics and the mapping between polarity and emotion classification in the SDTC confirm the consistency and regularity of the classification process. To the best of our knowledge, the SDTC is the first Twitter corpus for Saudi dialect labeled by three raters and classified based on the polarity of the tweets and the emotions they carry.

[1]  Mahmoud Al-Ayyoub,et al.  A prototype for a standard arabic sentiment analysis corpus , 2016, Int. Arab J. Inf. Technol..

[2]  Jingcheng Du,et al.  Leveraging machine learning-based approaches to assess human papillomavirus vaccination sentiment trends with Twitter data , 2017, BMC Medical Informatics and Decision Making.

[3]  Dayou Li,et al.  Identifying Mubasher software products through sentiment analysis of Arabic tweets , 2016, 2016 International Conference on Industrial Informatics and Computer Systems (CIICS).

[4]  Yang Liu,et al.  Multi-class sentiment classification: The experimental comparisons of feature selection and machine learning algorithms , 2017, Expert Syst. Appl..

[5]  Paolo Gastaldo,et al.  Concept-Level Sentiment Analysis with SenticNet , 2017 .

[6]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[7]  P. Young,et al.  Emotion and personality , 1963 .

[8]  Isa Maks,et al.  Analysis of patient satisfaction in Dutch and Spanish online reviews , 2017, Proces. del Leng. Natural.

[9]  Ahmed Z. Emam,et al.  E FFECT OF S AUDI DIALECT P REPROCESSING ON A RABIC S ENTIMENT A NALYSIS , 2015 .

[10]  Muhammad Abdul-Mageed,et al.  AWATIF: A Multi-Genre Corpus for Modern Standard Arabic Subjectivity and Sentiment Analysis , 2012, LREC.

[11]  P. Ekman An argument for basic emotions , 1992 .

[12]  Ahmed Emam,et al.  Saudi Twitter Corpus for Sentiment Analysis , 2016 .

[13]  R. Plutchik A GENERAL PSYCHOEVOLUTIONARY THEORY OF EMOTION , 1980 .

[14]  Amir F. Atiya,et al.  ASTD: Arabic Sentiment Tweets Dataset , 2015, EMNLP.

[15]  Hend Suliman Al-Khalifa,et al.  AraSenTi-Tweet: A Corpus for Arabic Sentiment Analysis of Saudi Tweets , 2017, ACLING.

[16]  Khaled Shaalan,et al.  Arabic Tweets Sentimental Analysis Using Machine Learning , 2017, IEA/AIE.

[17]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[18]  Huimin Zhao,et al.  Adapting sentiment lexicons to domain-specific social media texts , 2017, Decis. Support Syst..

[19]  Hazem M. Hajj,et al.  AROMA: A Recursive Deep Learning Model for Opinion Mining in Arabic as a Low Resource Language , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[20]  Aqil M. Azmi,et al.  Aara’– a system for mining the polarity of Saudi public opinion through e-newspaper comments , 2014, J. Inf. Sci..

[21]  Erik Cambria,et al.  CSenticNet: A Concept-Level Resource for Sentiment Analysis in Chinese Language , 2017, CICLing.

[22]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[23]  Saif M. Mohammad,et al.  Sentiment Analysis: Detecting Valence, Emotions, and Other Affectual States from Text , 2016, ArXiv.