论文信息 - POS Tagging for Arabic Tweets

POS Tagging for Arabic Tweets

Part-of-Speech (POS) tagging is a key step in many NLP algorithms. However, tweets are difficult to POS tag because there are many phenomena that frequently appear in Twitter that are not as common, or are entirely absent, in other domains: tweets are short, are not always written maintaining formal grammar and proper spelling, and abbreviations are often used to overcome their restricted lengths. Arabic tweets also show a further range of linguistic phenomena such as usage of different dialects, romanised Arabic and borrowing foreign words. In this paper, we present an evaluation and a detailed error analysis of stateof-the-art POS taggers for Arabic when applied to Arabic tweets. The accuracy of standard Arabic taggers is typically excellent (96-97%) on Modern Standard Arabic (MSA) text; however, their accuracy declines to 49-65% on Arabic tweets. Further, we present our initial approach to improve the taggers’ performance. By doing some improvements based on observed errors, we are able to reach 79% tagging accuracy.

Allan Ramsay | Fahad Albogamy

[1] Samhaa R. El-Beltagy,et al. A Fully Automated Approach for Arabic Slang Lexicon Extraction from Microblogs , 2014, CICLing.

[2] Mona T. Diab,et al. Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking , 2009 .

[3] Muhammad Abdul-Mageed,et al. SAMAR: A System for Subjectivity and Sentiment Analysis of Arabic Social Media , 2012, WASSA@ACL.

[4] Verena Rieser,et al. An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis , 2014, LREC.

[5] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[6] Roxana Girju,et al. A supervised POS tagger for written Arabic social networking corpora , 2012, KONVENS.

[7] Junlan Feng,et al. Robust Sentiment Detection on Twitter from Biased and Noisy Data , 2010, COLING.

[8] Fahad Albogamy,et al. Towards POS Tagging for Arabic Tweets , 2015, NUT@IJCNLP.

[9] Timothy Baldwin,et al. Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[10] Eric Brill,et al. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[11] Oren Etzioni,et al. Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.