Morphological disambiguation of Tunisian dialect

In this paper, we propose a method to disambiguate the output of a morphological analyzer of the Tunisian dialect. We test three machine-learning techniques that classify the morphological analysis of each word token into two classes: true and false. The class label is assigned to each analysis according to the context of the corresponding word in a sentence. In failure cases, we combine the results of the proposed techniques with a bigram classifier to choose only one analysis for a given word. We disambiguate the result of the morphological analyzer of the Tunisian Dialect Al-Khalil-TUN (Zribi et al., 2013b). We use the Spoken Tunisian Arabic Corpus STAC (Zribi et al., 2015) to train and test our method. The evaluation shows that the proposed method has achieved an accuracy performance of 87.32%.

[1]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[2]  Nizar Habash,et al.  Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development , 2014, LREC.

[3]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[4]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[5]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[6]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[7]  Nizar Habash,et al.  Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking , 2008, ACL.

[8]  Gérald Purnelle,et al.  Normalizing speech transcriptions for Natural Language Processing , 2009 .

[9]  W. N. H. W. Mohamed,et al.  A comparative study of Reduced Error Pruning method in decision tree algorithms , 2012, 2012 IEEE International Conference on Control System, Computing and Engineering.

[10]  Lamia Hadrich Belguith,et al.  Orthographic Transcription for Spoken Tunisian Arabic , 2013, CICLing.

[11]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[12]  Abdessatar Mahfoudhi,et al.  A Minimalist Account of Word Order and Agreement Variation in Arabic , 2002 .

[13]  Nizar Habash,et al.  A Conventional Orthography for Algerian Arabic , 2015, ANLP@ACL.

[14]  Nizar Habash,et al.  A Large Scale Corpus of Gulf Arabic , 2016, LREC.

[15]  Ahmed Hamdi Traitement automatique du dialecte tunisien à l'aide d'outils et de ressources de l'arabe standard : application à l'étiquetage morphosyntaxique , 2015 .

[16]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[17]  S. Khoja,et al.  APT: Arabic Part-of-speech Tagger , 2001 .

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Nizar Habash,et al.  POS-tagging of Tunisian Dialect Using Standard Arabic Resources and Tools , 2015, ANLP@ACL.

[20]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[21]  Nizar Habash,et al.  MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects , 2006, ACL.

[22]  Mohand Tilmatine,et al.  Substrat et convergences: le berbère et l¿arabe nord-africain , 1999 .

[23]  Wolfgang Maier,et al.  An Arabic-Moroccan Darija Code-Switched Corpus , 2016, LREC.

[24]  Frédéric Béchet,et al.  De l'arabe standard vers l'arabe dialectal : projection de corpus et ressources linguistiques en vue du traitement automatique de l'oral dans les médias tunisiens , 2014, Trait. Autom. des Langues.

[25]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[26]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[27]  Lamia Hadrich Belguith,et al.  Morphological Analysis of Tunisian Dialect , 2013, IJCNLP.

[28]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[29]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[30]  Kareem Darwish,et al.  Arabizi Detection and Conversion to Arabic , 2013, ANLP@EMNLP.

[31]  Nizar Habash,et al.  Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation , 2011, EMNLP 2011.

[32]  Philippe Blache,et al.  Spoken Tunisian Arabic Corpus "STAC": Transcription and Annotation , 2015, Res. Comput. Sci..

[33]  Roxana Girju,et al.  A supervised POS tagger for written Arabic social networking corpora , 2012, KONVENS.

[34]  Yamina Tlili-Guiassa Hybrid Method for Tagging Arabic Text , 2006 .

[35]  Philippe Blache,et al.  Sentence Boundary Detection for Transcribed Tunisian Arabic , 2016, KONVENS.

[36]  Nizar Habash,et al.  Building a Corpus for Palestinian Arabic: a Preliminary Study , 2014, ANLP@EMNLP.

[37]  Maciej Piasecki,et al.  Multiclassifier Approach to Tagging of Polish , 2006 .

[38]  Nizar Habash,et al.  Parsing Arabic Dialects , 2006, EACL.

[39]  Nizar Habash,et al.  A Conventional Orthography for Tunisian Arabic , 2014, LREC.

[40]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[41]  Nizar Habash,et al.  Automatic Transliteration of Romanized Dialectal Arabic , 2014, CoNLL.

[42]  Nizar Habash,et al.  Morphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic , 2016, LREC.

[43]  Ahmad T. Al-Taani,et al.  A rule-based approach for tagging non-vocalized Arabic words , 2009, Int. Arab J. Inf. Technol..

[44]  Nizar Habash,et al.  Morphological Analysis and Generation for Arabic Dialects , 2005, SEMITIC@ACL.

[45]  Nizar Habash,et al.  ADAM: Analyzer for Dialectal Arabic Morphology , 2014, J. King Saud Univ. Comput. Inf. Sci..

[46]  Kevin Duh,et al.  POS Tagging of Dialectal Arabic: A Minimally Supervised Approach , 2005, SEMITIC@ACL.