A Characterization Study of Arabic Twitter Data with a Benchmarking for State-of-the-Art Opinion Mining Models

Opinion mining in Arabic is a challenging task given the rich morphology of the language. The task becomes more challenging when it is applied to Twitter data, which contains additional sources of noise, such as the use of unstandardized dialectal variations, the nonconformation to grammatical rules, the use of Arabizi and code-switching, and the use of non-text objects such as images and URLs to express opinion. In this paper, we perform an analytical study to observe how such linguistic phenomena vary across different Arab regions. This study of Arabic Twitter characterization aims at providing better understanding of Arabic Tweets, and fostering advanced research on the topic. Furthermore, we explore the performance of the two schools of machine learning on Arabic Twitter, namely the feature engineering approach and the deep learning approach. We consider models that have achieved state-of-the-art performance for opinion mining in English. Results highlight the advantages of using deep learning-based models, and confirm the importance of using morphological abstractions to address Arabic’s complex morphology.

[1]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[2]  Luis Alfonso Ureña López,et al.  OCA: Opinion corpus for Arabic , 2011, J. Assoc. Inf. Sci. Technol..

[3]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[4]  Nizar Habash,et al.  Creating Resources for Dialectal Arabic from a Single Annotation: A Case Study on Egyptian and Levantine , 2016, COLING.

[5]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[6]  Hazem M. Hajj,et al.  Deep Learning Models for Sentiment Analysis in Arabic , 2015, ANLP@ACL.

[7]  Nizar Habash,et al.  Arabic Corpora for Credibility Analysis , 2016, LREC.

[8]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[9]  Hend Suliman Al-Khalifa,et al.  AraSenTi: Large-Scale Twitter-Specific Arabic Sentiment Lexicons , 2016, ACL.

[10]  Hazem M. Hajj,et al.  Sentence-Level and Document-Level Sentiment Mining for Arabic Texts , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[13]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[14]  Muhammad Abdul-Mageed,et al.  Subjectivity and Sentiment Analysis of Modern Standard Arabic , 2011, ACL.

[15]  Amir F. Atiya,et al.  ASTD: Arabic Sentiment Tweets Dataset , 2015, EMNLP.

[16]  Amir F. Atiya,et al.  LABR: A Large Scale Arabic Book Reviews Dataset , 2013, ACL.

[17]  Nizar Habash,et al.  A Large Scale Arabic Sentiment Lexicon for Arabic Opinion Mining , 2014, ANLP@EMNLP.

[18]  Hesham Arafat,et al.  Different Feature Selection for Sentiment Classification , 2014 .

[19]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[20]  Houda Benbrahim,et al.  An empirical study to address the problem of Unbalanced Data Sets in sentiment classification , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[21]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[22]  Hsinchun Chen,et al.  Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums , 2008, TOIS.

[23]  Georgios Balikas,et al.  TwiSE at SemEval-2016 Task 4: Twitter Sentiment Classification , 2016, *SEMEVAL.

[24]  Saif Mohammad,et al.  How Translation Alters Sentiment , 2016, J. Artif. Intell. Res..

[25]  R. M. Duwairi,et al.  Sentiment Analysis in Arabic tweets , 2014, 2014 5th International Conference on Information and Communication Systems (ICICS).

[26]  Saif Mohammad,et al.  Sentiment after Translation: A Case-Study on Arabic Social Media Posts , 2015, NAACL.

[27]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[28]  Khaled Shaalan Nizar Y. Habash, Introduction to Arabic natural language processing (Synthesis lectures on human language technologies) , 2011, Machine Translation.

[29]  Hazem M. Hajj,et al.  A Light Lexicon-based Mobile Application for Sentiment Mining of Arabic Tweets , 2015, ANLP@ACL.

[30]  Hazem M. Hajj,et al.  Machine Reading for Notion-Based Sentiment Mining , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[31]  Nazlia Omar,et al.  Ensemble of Classification Algorithms for Subjectivity and Sentiment Analysis of Arabic Customers ' Reviews , 2013 .

[32]  Muhammad Abdul-Mageed,et al.  SAMAR: Subjectivity and sentiment analysis for Arabic social media , 2014, Comput. Speech Lang..

[33]  Hazem M. Hajj,et al.  A Meta-Framework for Modeling the Human Reading Process in Sentiment Analysis , 2016, ACM Trans. Inf. Syst..

[34]  Aurélien Lucchi,et al.  SwissCheese at SemEval-2016 Task 4: Sentiment Classification Using an Ensemble of Convolutional Neural Networks with Distant Supervision , 2016, *SEMEVAL.

[35]  A. Shoukry,et al.  Sentence-level Arabic sentiment analysis , 2012, 2012 International Conference on Collaboration Technologies and Systems (CTS).

[36]  Petra Kralj Novak,et al.  Sentiment of Emojis , 2015, PloS one.

[37]  Kemal Oflazer,et al.  Large Scale Arabic Error Annotation: Guidelines and Framework , 2014, LREC.

[38]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..