Fast and Robust POS tagger for Arabic Tweets Using Agreement-based Bootstrapping

Part-of-Speech(POS) tagging is a key step in many NLP algorithms. However, tweets are difficult to POS tag because they are short, are not always written maintaining formal grammar and proper spelling, and abbreviations are often used to overcome their restricted lengths. Arabic tweets also show a further range of linguistic phenomena such as usage of different dialects, romanised Arabic and borrowing foreign words. In this paper, we present an evaluation and a detailed error analysis of state-of-the-art POS taggers for Arabic when applied to Arabic tweets. On the basis of this analysis, we combine normalisation and external knowledge to handle the domain noisiness and exploit bootstrapping to construct extra training data in order to improve POS tagging for Arabic tweets. Our results show significant improvements over the performance of a number of well-known taggers for Arabic.

[1]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[2]  Samhaa R. El-Beltagy,et al.  A Fully Automated Approach for Arabic Slang Lexicon Extraction from Microblogs , 2014, CICLing.

[3]  Suleiman H. Mustafa Word Stemming for Arabic Information Retrieval: The Case for Simple Light Stemming , 2012 .

[4]  Muhammad Abdul-Mageed,et al.  SAMAR: A System for Subjectivity and Sentiment Analysis of Arabic Social Media , 2012, WASSA@ACL.

[5]  Allan Ramsay,et al.  POS Tagging for Arabic Tweets , 2015, RANLP.

[6]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[7]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[8]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[9]  Verena Rieser,et al.  An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis , 2014, LREC.

[10]  Walter Daelemans,et al.  Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers , 2000, LREC.

[11]  Kareem Darwish,et al.  Subjectivity and Sentiment Analysis of Modern Standard Arabic and Arabic Microblogs , 2013, WASSA@NAACL-HLT.

[12]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[13]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[14]  David Yarowsky,et al.  Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day , 2002, CoNLL.

[15]  Nawal A. El-Fishawy,et al.  Arabic summarization in Twitter social network , 2014 .

[16]  James R. Curran,et al.  Bootstrapping POS-taggers using unlabelled data , 2003, CoNLL.

[17]  Josef van Genabith,et al.  #hardtoparse: POS Tagging and Parsing the Twitterverse , 2011, Analyzing Microtext.

[18]  Tanveer A. Faruquie,et al.  Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results , 2011, MOCR_AND '11.

[19]  Junlan Feng,et al.  Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[20]  Fahad Albogamy,et al.  Towards POS Tagging for Arabic Tweets , 2015, NUT@IJCNLP.

[21]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[22]  Mona T. Diab,et al.  Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking , 2009 .

[23]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[24]  Roxana Girju,et al.  A supervised POS tagger for written Arabic social networking corpora , 2012, KONVENS.