Segmentation for Domain Adaptation in Arabic

Segmentation serves as an integral part in many NLP applications including Machine Translation, Parsing, and Information Retrieval. When a model trained on the standard language is applied to dialects, the accuracy drops dramatically. However, there are more lexical items shared by the standard language and dialects than can be found by mere surface word matching. This shared lexicon is obscured by a lot of cliticization, gemination, and character repetition. In this paper, we prove that segmentation and base normalization of dialects can help in domain adaptation by reducing data sparseness. Segmentation will improve a system performance by reducing the number of OOVs, help isolate the differences and allow better utilization of the commonalities. We show that adding a small amount of dialectal segmentation training data reduced OOVs by 5% and remarkably improves POS tagging for dialects by 7.37% f-score, even though no dialect-specific POS training data is included.

[1]  Joakim Nivre,et al.  An Efficient Algorithm for Projective Dependency Parsing , 2003, IWPT.

[2]  Joakim Nivre,et al.  Memory-Based Dependency Parsing , 2004, CoNLL.

[3]  Romain Laroche,et al.  Contextual Bandit for Active Learning: Active Thompson Sampling , 2014, ICONIP.

[4]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[5]  Laura Kallmeyer,et al.  Learning from Relatives: Unified Dialectal Arabic Segmentation , 2017, CoNLL.

[6]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[7]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[8]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[9]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[10]  Emad Mohamed Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[11]  Christopher D. Manning,et al.  Word Segmentation of Informal Arabic with Domain Adaptation , 2014, ACL.

[12]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[13]  Laura Kallmeyer,et al.  A Neural Architecture for Dialectal Arabic Segmentation , 2017, WANLP@EACL.

[14]  Ahmed H. Aliwy,et al.  Tokenization as Preprocessing for Arabic Tagging System , 2012 .

[15]  Kemal Oflazer,et al.  Annotating and Learning Morphological Segmentation of Egyptian Colloquial Arabic , 2012, LREC.

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[18]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[19]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[20]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[21]  Mona T. Diab,et al.  Second Generation AMIRA Tools for Arabic Processing : Fast and Robust Tokenization , POS tagging , and Base Phrase Chunking , 2009 .

[22]  Nadir Durrani,et al.  Farasa: A Fast and Furious Segmenter for Arabic , 2016, NAACL.

[23]  B. Hladká,et al.  The Prague Dependency Treebank: Annotation Structure and Support , 2022 .

[24]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[25]  Nizar Habash,et al.  Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop , 2005, ACL.

[26]  Yuji Matsumoto,et al.  Statistical Dependency Analysis with Support Vector Machines , 2003, IWPT.

[27]  Josef van Genabith,et al.  Improved Spelling Error Detection and Correction for Arabic , 2012, COLING.