Towards POS Tagging for Arabic Tweets

Part-of-Speech (POS) tagging is a key step in many NLP algorithms. However, tweets are difficult to POS tag because there are many phenomena that frequently appear in Twitter that are not as common, or are entirely absent, in other domains: tweets are short, are not always written maintaining formal grammar and proper spelling, and abbreviations are often used to overcome their restricted lengths. Arabic tweets also show a further range of linguistic phenomena such as usage of different dialects, romanised Arabic and borrowing foreign words. In this paper, we present an evaluation and a detailed error analysis of stateof-the-art POS taggers for Arabic when applied to Arabic tweets. The accuracy of standard Arabic taggers is typically excellent (96-97%) on Modern Standard Arabic (MSA) text ; however,their accuracy declines to 49-65% on Arabic tweets. Further, we present our initial approach to improve the taggers’ performance. By making improvements based on observed errors, we are able to reach 74% tagging accuracy.

[1]  Kareem Darwish,et al.  Subjectivity and Sentiment Analysis of Modern Standard Arabic and Arabic Microblogs , 2013, WASSA@NAACL-HLT.

[2]  Muhammad Abdul-Mageed,et al.  SAMAR: A System for Subjectivity and Sentiment Analysis of Arabic Social Media , 2012, WASSA@ACL.

[3]  Roxana Girju,et al.  A supervised POS tagger for written Arabic social networking corpora , 2012, KONVENS.

[4]  Walter Daelemans,et al.  Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers , 2000, LREC.

[5]  Tanveer A. Faruquie,et al.  Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results , 2011, MOCR_AND '11.

[6]  David Yarowsky,et al.  Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day , 2002, CoNLL.

[7]  Nawal A. El-Fishawy,et al.  Arabic summarization in Twitter social network , 2014 .

[8]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[9]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[10]  Mona T. Diab,et al.  Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking , 2009 .

[11]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[12]  James R. Curran,et al.  Bootstrapping POS-taggers using unlabelled data , 2003, CoNLL.

[13]  Verena Rieser,et al.  An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis , 2014, LREC.

[14]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[15]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[16]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.