POS-tagging of Tunisian Dialect Using Standard Arabic Resources and Tools

Developing natural language processing tools usually requires a large number of resources (lexica, annotated corpora, etc.), which often do not exist for less-resourced languages. One way to overcome the problem of lack of resources is to devote substantial efforts to build new ones from scratch. Another approach is to exploit existing resources of closely related languages. In this paper, we focus on developing a part-of-speech tagger for the Tunisian Arabic dialect (TUN), a low-resource language, by exploiting its close-ness to Modern Standard Arabic (MSA), which has many state-of-the-art resources and tools. Our system achieved an accuracy of 89% (∼20% absolute improvement over an MSA tagger baseline).

[1]  Günter Neumann,et al.  Arabic Computational Morphology: Knowledge-based and Empirical Methods , 2007 .

[2]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[3]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[4]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[5]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[6]  Marianne Vergez-Couret Tagging Occitan using French and Castillan Tree Tagger , 2013 .

[7]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[8]  Nizar Habash,et al.  Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation , 2011, EMNLP 2011.

[9]  Nizar Habash,et al.  The Effects of Factorizing Root and Pattern Mapping in Bidirectional Tunisian - Standard Arabic Machine Translation , 2013, MTSUMMIT.

[10]  Lamia Hadrich Belguith,et al.  Building bilingual lexicon to create Dialect Tunisian corpora and adapt language model , 2013, HyTra@ACL.

[11]  Karima Meftouh,et al.  Building resources for Algerian Arabic dialects , 2014, INTERSPEECH.

[12]  Nizar Habash,et al.  Morphological Analysis and Generation of Arabic Nouns: A Morphemic Functional Approach , 2010, LREC.

[13]  Pavel Pecina,et al.  Simpler unsupervised POS tagging with bilingual projections , 2013, ACL.

[14]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[15]  Kemal Oflazer,et al.  A Multidialectal Parallel Corpus of Arabic , 2014, LREC.

[16]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[17]  Nizar Habash,et al.  CATiB: The Columbia Arabic Treebank , 2009, ACL.

[18]  Ben Taskar,et al.  Wiki-ly Supervised Part-of-Speech Tagging , 2012, EMNLP.

[19]  Delphine Bernhard,et al.  Hassle-free POS-Tagging for the Alsatian Dialects , 2013 .

[20]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[21]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[22]  Nizar Habash,et al.  Morphological Analysis and Generation for Arabic Dialects , 2005, SEMITIC@ACL.

[23]  Yonatan Belinkov,et al.  Translating Dialectal Arabic to English , 2013, ACL.

[24]  Nizar Habash,et al.  The Effects of Factorizing Root and Pattern Mapping in Bidirectional Tunisian - Standard Arabic Machine Translation , 2013, MTSUMMIT.

[25]  Chris Brew,et al.  A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources , 2006, LREC.

[26]  Roxana Girju,et al.  Mining the Web for the Induction of a Dialectical Arabic Lexicon , 2010, LREC.

[27]  Kevin Duh,et al.  POS Tagging of Dialectal Arabic: A Minimally Supervised Approach , 2005, SEMITIC@ACL.

[28]  Kemal Oflazer,et al.  Annotating and Learning Morphological Segmentation of Egyptian Colloquial Arabic , 2012, LREC.

[29]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[30]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[31]  Lamia Hadrich Belguith,et al.  Fine-Grained POS Tagging of Spoken Tunisian Dialect Corpora , 2014, NLDB.

[32]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[33]  Roxana Girju,et al.  A supervised POS tagger for written Arabic social networking corpora , 2012, KONVENS.

[34]  Alexis Nasr,et al.  Automatically building a Tunisian Lexicon for Deverbal Nouns , 2014, VarDial@COLING.

[35]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[36]  Nizar Habash,et al.  LDC Arabic Treebanks and Associated Corpora: Data Divisions Manual , 2013, ArXiv.

[37]  Mona T. Diab,et al.  Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations , 2012, LREC.

[38]  Lamia Hadrich Belguith,et al.  Morphological Analysis of Tunisian Dialect , 2013, IJCNLP.

[39]  H. Sawaf Arabic Dialect Handling in Hybrid Machine Translation , 2010, AMTA.

[40]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[41]  George Anton Kiraz,et al.  Multitiered nonlinear morphology using multitape finite automata: a case study on Syriac and Arabic , 2000, CL.

[42]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[43]  Joakim Nivre,et al.  Token and Type Constraints for Cross-Lingual Part-of-Speech Tagging , 2013, TACL.