Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.

[1]  Ngoc Thang Vu,et al.  ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic-English , 2022, WANLP.

[2]  Ngoc Thang Vu,et al.  Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation , 2022, LORESMT.

[3]  Ngoc Thang Vu,et al.  BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages , 2022, FINDINGS.

[4]  Mona T. Diab,et al.  CALCS 2021 Shared Task: Machine Translation for Code-Switched Data , 2022, ArXiv.

[5]  Ngoc Thang Vu,et al.  Investigations on Speech Recognition Systems for Low-Resource Dialectal Arabic-English Code-Switching Speech , 2021, Comput. Speech Lang..

[6]  Marcin Junczys-Dowmunt,et al.  To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , 2021, WMT.

[7]  Preethi Jyothi,et al.  From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text , 2021, ACL.

[8]  Franccois Yvon,et al.  Can You Traducir This? Machine Translation for Code-Switched Input , 2021, CALCS.

[9]  Jonne Saleva,et al.  The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation , 2021, EACL.

[10]  Kyunghyun Cho,et al.  Neural machine translation with a polysynthetic low resource language , 2020, Machine Translation.

[11]  Smaranda Muresan,et al.  MorphAGram, Evaluation and Framework for Unsupervised Morphological Segmentation , 2020, LREC.

[12]  Alexander Erdmann,et al.  CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing , 2020, LREC.

[13]  Ngoc Thang Vu,et al.  ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English , 2020, LREC.

[14]  Ngoc Thang Vu,et al.  Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus , 2020, LREC.

[15]  Mayank Singh,et al.  PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation , 2020, WNUT.

[16]  Mikko Kurimo,et al.  Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning , 2020, LREC.

[17]  Yating Yang,et al.  Morphological Word Segmentation on Agglutinative Languages for Neural Machine Translation , 2020, ArXiv.

[18]  Sivaji Bandyopadhyay,et al.  Code-Mixed to Monolingual Translation Framework , 2019, FIRE.

[19]  Ahmed Y. Tawfik,et al.  Morphology-aware Word-Segmentation in Dialectal Arabic Adaptation of Neural Machine Translation , 2019, WANLP@ACL 2019.

[20]  Nizar Habash,et al.  The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation , 2019, MTSummit.

[21]  Kamel Smaïli,et al.  Machine Translation on a Parallel Code-Switched Corpus , 2019, Canadian AI.

[22]  Yue Zhang,et al.  Code-Switching for Enhancing NMT with Pre-Specified Translation , 2019, NAACL.

[23]  Katharina Kann,et al.  Subword-Level Language Identification for Intra-Word Code-Switching , 2019, NAACL.

[24]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[25]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[26]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[27]  Kemal Oflazer,et al.  The MADAR Arabic Dialect Corpus and Lexicon , 2018, LREC.

[28]  Slim Abdennadher,et al.  Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus , 2018, LREC.

[29]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Marcello Federico,et al.  Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English , 2017, Prague Bull. Math. Linguistics.

[32]  Nizar Habash,et al.  Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages , 2017, Prague Bull. Math. Linguistics.

[33]  Nizar Habash,et al.  Universal Dependencies for Arabic , 2017, WANLP@EACL.

[34]  Ngoc Thang Vu,et al.  Challenges of Computational Processing of Code-Switching , 2016, CodeSwitch@EMNLP.

[35]  Abdulmohsen Al-Thubaity,et al.  Effect of word segmentation on Arabic text classification , 2015, 2015 International Conference on Asian Language Processing (IALP).

[36]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[37]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[38]  Mikko Kurimo,et al.  Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology , 2014, COLING.

[39]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[40]  Mikko Kurimo,et al.  Morfessor 2.0: Toolkit for statistical morphological segmentation , 2014, EACL.

[41]  Nizar Habash,et al.  Introduction to Arabic Natural Language Processing , 2010, Introduction to Arabic Natural Language Processing.

[42]  Christian Monson,et al.  EMMA: A novel Evaluation Metric for Morphological Analysis , 2010, COLING.

[43]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[44]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[45]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[46]  Jamila El Gizuli,et al.  Camel Treebank: An Open Multi-genre Arabic Dependency Treebank , 2022, International Conference on Language Resources and Evaluation.

[47]  Antonio Toral,et al.  Machine Translation for English–Inuktitut with Segmentation, Data Acquisition and Pre-Training , 2020, WMT.

[48]  Manish Shrivastava,et al.  Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation Approach , 2018 .

[49]  Pushpak Bhattacharyya,et al.  Meaningless yet meaningful: Morphology grounded subword-level NMT , 2018 .

[50]  Tat-siong Benny Liew,et al.  Colonialism and the Bible : Contemporary Reflections from the Global South , 2018 .

[51]  Maja Popovic,et al.  chrF++: words helping character n-grams , 2017, WMT.

[52]  Nizar Habash,et al.  Morphological Analysis and Disambiguation for Dialectal Arabic , 2013, NAACL.

[53]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[54]  Nizar Habash,et al.  Conventional Orthography for Dialectal Arabic , 2012, LREC.

[55]  Nizar Habash,et al.  50th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference Volume 2: Short Papers , 2012 .

[56]  Nizar Habash,et al.  On Arabic Transliteration , 2007 .

[57]  R. Sinha,et al.  Machine Translation of Bi-lingual Hindi-English (Hinglish) Text , 2005, MTSUMMIT.

[58]  M. Maamouri,et al.  The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus , 2004 .

[59]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .