Can You Traducir This? Machine Translation for Code-Switched Input

Code-Switching (CSW) is a common phenomenon that occurs in multilingual geographic or social contexts, which raises challenging problems for natural language processing tools. We focus here on Machine Translation (MT) of CSW texts, where we aim to simultaneously disentangle and translate the two mixed languages. Due to the lack of actual translated CSW data, we generate artificial training data from regular parallel texts. Experiments show this training strategy yields MT systems that surpass multilingual systems for code-switched texts. These results are confirmed in an alternative task aimed at providing contextual translations for a L2 writing assistant.

[1]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[2]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[3]  A. Gispert,et al.  Reordered Search, and Tuple Unfolding for Ngram-based SMT , 2005, MTSUMMIT.

[4]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[5]  Zhen Yang,et al.  CSP: Code-Switching Pre-training for Neural Machine Translation , 2020, EMNLP.

[6]  Arda Tezcan,et al.  Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation , 2019, ACL.

[7]  Monojit Choudhury,et al.  GLUECoS: An Evaluation Benchmark for Code-Switched NLP , 2020, ACL.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Shana Poplack,et al.  Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1 , 1980 .

[10]  Philipp Koehn,et al.  Spelling-Aware Construction of Macaronic Texts for Teaching Foreign-Language Vocabulary , 2019, EMNLP/IJCNLP.

[11]  François Yvon,et al.  Priming Neural Machine Translation , 2020, WMT.

[12]  Laura Kallmeyer,et al.  Multilingual Code-switching Identification via LSTM Recurrent Neural Networks , 2016, CodeSwitch@EMNLP.

[13]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[14]  Yaser Al-Onaizan,et al.  Training Neural Machine Translation to Apply Terminology Constraints , 2019, ACL.

[15]  Preethi Jyothi,et al.  Code-switched Language Models Using Dual RNNs and Same-Source Pretraining , 2018, EMNLP.

[16]  Almeida Jacqueline Toribio,et al.  Code switching and X-bar theory: the fuctional head constraint , 1994 .

[17]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[18]  Carol Pfaff Constraints on Language Mixing: Intrasentential Code-Switching and Borrowing in Spanish/English , 1979 .

[19]  Josep Maria Crego,et al.  Boosting Neural Machine Translation with Similar Translations , 2020, ACL.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Yoshua Bengio,et al.  Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism , 2016, NAACL.

[22]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[23]  Dan Garrette,et al.  Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification , 2018, EMNLP.

[24]  Iris Hendrickx,et al.  SemEval 2014 Task 5 - L2 Writing Assistant , 2014, SemEval@COLING.

[25]  Yoav Goldberg,et al.  Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training , 2018, EMNLP.

[26]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[27]  Arkaitz Zubiaga,et al.  TweetLID: a benchmark for tweet language identification , 2016, Lang. Resour. Evaluation.

[28]  Philipp Koehn,et al.  Simple Construction of Mixed-Language Texts for Vocabulary Learning , 2019, BEA@ACL.

[29]  Tanmoy Chakraborty,et al.  SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets , 2020, SEMEVAL.

[30]  Carol Myers-Scotton,et al.  Duelling Languages: Grammatical Structure in Codeswitching , 1993 .

[31]  Jatin Sharma,et al.  “I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook , 2014, CodeSwitch@EMNLP.

[32]  Thamar Solorio,et al.  LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation , 2020, LREC.

[33]  Pascale Fung,et al.  Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences , 2019, CoNLL.

[34]  Haizhou Li,et al.  Modeling Code-Switch Languages Using Bilingual Parallel Corpus , 2020, ACL.

[35]  Yue Zhang,et al.  Code-Switching for Enhancing NMT with Pre-Specified Translation , 2019, NAACL.

[36]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[37]  Monojit Choudhury,et al.  Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data , 2018, ACL.

[38]  Julia Hirschberg,et al.  Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task , 2018, CodeSwitch@ACL.

[39]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[40]  Alexander Yates,et al.  Improving Word Alignment Using Linguistic Code Switching Data , 2014, EACL.

[41]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[42]  Matt Post,et al.  Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation , 2018, NAACL.

[43]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[44]  Huda Khayrallah,et al.  Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting , 2019, NAACL.

[45]  Monojit Choudhury,et al.  Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique , 2017, ACL.

[46]  Claudia Gdaniec,et al.  Morphology to the Rescue Redux: Resolving Borrowings and Code-Mixing in Machine Translation , 2011, SFCM.

[47]  Almeida Jacqueline Toribio,et al.  Code Switching and X-Bar Theory : The Functional Head Constraint , 2008 .

[48]  Thamar Solorio,et al.  From English to Code-Switching: Transfer Learning with Strong Morphological Clues , 2020, ACL.

[49]  Kamel Smaïli,et al.  Machine Translation on a Parallel Code-Switched Corpus , 2019, Canadian AI.

[50]  Alan W. Black,et al.  A Survey of Code-switched Speech and Language Processing , 2019, ArXiv.