Building a Corpus for Palestinian Arabic: a Preliminary Study

This paper presents preliminary results in building an annotated corpus of the Palestinian Arabic dialect. The corpus consists of about 43K words, stemming from diverse resources. The paper discusses some linguistic facts about the Palestinian dialect, compared with the Modern Standard Arabic, especially in terms of morphological, orthographic, and lexical variations, and suggests some directions to resolve the challenges these differences pose to the annotation goal. Furthermore, we present two pilot studies that investigate whether existing tools for processing Modern Standard Arabic and Egyptian Arabic can be used to speed up the annotation process of our Palestinian Arabic corpus.

[1]  Mona T. Diab,et al.  COLABA : Arabic Dialect Annotation and Processing , 2011 .

[2]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[3]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[4]  Nizar Habash,et al.  Processing Spontaneous Orthography , 2013, NAACL.

[5]  Khaled Shaalan,et al.  A Hybrid Approach for Converting Written Egyptian Colloquial Dialect into Diacritized Arabic , 2008 .

[6]  Nizar Habash,et al.  Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development , 2014, LREC.

[7]  Nizar Habash,et al.  Developing and Using a Pilot Dialectal Arabic Treebank , 2006, LREC.

[8]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[9]  Nizar Habash,et al.  Automatic Morphological Enrichment of a Morphologically Underspecified Treebank , 2013, NAACL.

[10]  M. Maamouri,et al.  Resources for arabic natural language processing at the linguistic data consortium , 2005 .

[11]  Margaret K. Omar Levantine and Egyptian Arabic. Comparative Study. , 1976 .

[12]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[13]  Mark W. Cowell A Reference Grammar of Syrian Arabic , 1964 .

[14]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[15]  Clive Holes,et al.  Modern Arabic: Structures, Functions, and Varieties , 1996 .

[16]  Frank A. Rice,et al.  Eastern Arabic : an introduction to the spoken Arabic of Palestine Syria and Lebanon , 1960 .

[17]  Christiane Fellbaum,et al.  Introducing the Arabic WordNet project , 2006 .

[18]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[19]  Nizar Habash,et al.  Automatic Transliteration of Romanized Dialectal Arabic , 2014, CoNLL.

[20]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[21]  M. Halloun,et al.  A practical dictionary of the standard dialect spoken in Palestine , 2000 .

[22]  Nizar Habash,et al.  Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation , 2011, EMNLP 2011.

[23]  Roxana Girju,et al.  YADAC: Yet another Dialectal Arabic Corpus , 2012, LREC.

[24]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[25]  Mary Catherine Bateson,et al.  Arabic Language Handbook , 1967 .

[26]  Nizar Habash,et al.  CATiB: The Columbia Arabic Treebank , 2009, ACL.

[27]  Yohanan Elihai,et al.  The olive tree dictionary : a transliterated dictionary of conversational Eastern Arabic (Palestinian) , 2004 .

[28]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[29]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[30]  Kareem Darwish,et al.  Arabizi Detection and Conversion to Arabic , 2013, ANLP@EMNLP.

[31]  K. Brustad The Syntax of Spoken Arabic: A Comparative Study of Moroccan, Egyptian, Syrian, and Kuwaiti Dialects. , 2002 .