Machine Translation on a Parallel Code-Switched Corpus

Code-switching (CS) is the phenomenon that occurs when a speaker alternates between two or more languages within an utterance or discourse. In this work, we investigate the existence of code-switching in formal text, namely proceedings of multilingual institutions. Our study is carried out on the Arabic-English code-mixing in a parallel corpus extracted from official documents of United Nations. We build a parallel code-switched corpus with two reference translations one in pure Arabic and the other in pure English. We also carry out a human evaluation of this resource in the aim to use it to evaluate the translation of code-switched documents. To the best of our knowledge, this kind of corpora does not exist. The one we propose is unique. This paper examines several methods to translate code-switched corpus: conventional statistical machine translation, the end-to-end neural machine translation and multitask-learning.

[1]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Amitava Das,et al.  Comparing the Level of Code-Switching in Corpora , 2016, LREC.

[3]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[4]  Lars Hinrichs,et al.  World Englishes, Code-Switching, and Convergence , 2017 .

[5]  Kamel Smaïli,et al.  An empirical study of the Algerian dialect of Social network , 2017 .

[6]  Carolyn Penstein Rosé,et al.  Code-Switching as a Social Act: The Case of Arabic Wikipedia Talk Pages , 2017, NLP+CSS@ACL.

[7]  John R. Hershey,et al.  Language independent end-to-end architecture for joint language identification and speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8]  Preethi Jyothi,et al.  Dual Language Models for Code Mixed Speech Recognition , 2018, INTERSPEECH.

[9]  Marine Carpuat,et al.  Mixed Language and Code-Switching in the Canadian Hansard , 2014, CodeSwitch@EMNLP.

[10]  Thamar Solorio,et al.  Overview for the Second Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[11]  Andreas Eisele,et al.  MultiUN: A Multilingual Corpus from United Nation Documents , 2010, LREC.

[12]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[13]  Kamel Smaïli,et al.  CALYOU: A Comparable Spoken Algerian Corpus Harvested from YouTube , 2017, INTERSPEECH.

[14]  Shana Poplack,et al.  Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1 , 1980 .

[15]  Shana Poplack,et al.  Sometimes I'll Start a Sentence in Spanish Y Termino En Espanol: toward a Typology of Code-switching 1 , 2010 .