Comparative analysis of ML POS on Arabic tweets

One of the challenges of natural language processing is social media text like tweets. Conversational text in contrast to genres that are highly edited (standard language) which traditional NLP tools have been developed for contains many syntactic patterns and non-standard lexical items. These are the outcomes of dialectal variation, diversity in topic, orthography, unintended errors, conversational errors and creative language use. The fact that twitter text is characterized by idiosyncratic style, noise and linguistic errors makes it difficult to part-of-speech tag. The aim of this paper is to design and implement models of speech tagging for Arabic tweets by investigating numerous models of machine learning like K-Nearest Neighbour, Naïve Bayes and Decision tree models. In this paper, a novel Arabic Twitter corpus is introduced while assessing various state-of-the-art POS taggers which retrained on the given corpus. A state-of-the-art accuracy of 87.97% is achieved when tagging twitter.

[1]  Chris Dyer,et al.  Part-of-Speech Tagging for Twitter : Word Clusters and Other Advances , 2012 .

[2]  Aron Culotta,et al.  Towards detecting influenza epidemics by analyzing Twitter messages , 2010, SOMA '10.

[3]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[4]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[5]  Wesley De Neve,et al.  Alleviating manual feature engineering for part-of-speech tagging of Twitter microposts using distributed word representations , 2014, NIPS 2014.

[6]  Anupam Jamatia Part-of-Speech Tagging System for Indian Social Media Text on Twitter , 2014 .

[7]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[8]  Roxana Girju,et al.  A supervised POS tagger for written Arabic social networking corpora , 2012, KONVENS.

[9]  Teresa Lynn,et al.  Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets , 2015, NUT@IJCNLP.

[10]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[11]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[12]  Mike Thelwall,et al.  Sentiment in Twitter events , 2011, J. Assoc. Inf. Sci. Technol..

[13]  M. Osborne,et al.  Bieber no more : First Story Detection using Twitter and Wikipedia , 2012 .

[14]  Josef van Genabith,et al.  #hardtoparse: POS Tagging and Parsing the Twitterverse , 2011, Analyzing Microtext.

[15]  Caroline Brun,et al.  Part of Speech Tagging for French Social Media Data , 2014, COLING.