Recognition and translation of code-switching speech utterances

Code-switching (CS), a hallmark of worldwide bilingual communities, refers to a strategy adopted by bilinguals (or multilinguals) who mix two or more languages in a discourse often with little change of interlocutor or topic. The units and the locations of the switches may vary widely from single-word switches to whole phrases (beyond the length of the loanword units). Such phenomena pose challenges for spoken language technologies, i.e., automatic speech recognition (ASR), since the systems need to be able to handle the input in a multilingual setting. Several works constructed a CS ASR on many different language pairs. But the common aim of developing a CS ASR is merely for transcribing CS-speech utterances into CS-text sentences within a single individual. In contrast, in this study, we address the situational context that happens during dialogs between CS and non-CS (monolingual) speakers and support monolingual speakers who want to understand CS speakers. We construct a system that recognizes and translates from codeswitching speech to monolingual text. We investigated several approaches, including a cascade of ASR and a neural machine translation (NMT), a cascade of ASR and a deep bidirectional language model (BERT), an ASR that directly outputs monolingual transcriptions from CS speech, and multi-task learning. Finally, we evaluate and discuss these four ways on a Japanese- English CS to English monolingual task.

[1]  Sanjeev Khudanpur,et al.  An investigation of acoustic models for multilingual code-switching , 2008, INTERSPEECH.

[2]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[3]  Navdeep Jaitly,et al.  Sequence-to-Sequence Models Can Directly Translate Foreign Speech , 2017, INTERSPEECH.

[4]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[8]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[10]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[11]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[12]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[15]  Omer Levy,et al.  Constant-Time Machine Translation with Conditional Masked Language Models , 2019, IJCNLP 2019.

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[18]  Hervé Bourlard,et al.  Language dependent universal phoneme posterior estimation for mixed language speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  David A. van Leeuwen,et al.  Investigating Bilingual Deep Neural Networks for Automatic Recognition of Code-switching Frisian Speech , 2016, SLTU.

[20]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[21]  Maria Lourdes S. Bautista,et al.  Tagalog-english code switching as a mode of discourse , 2004 .

[22]  Eiichiro Sumita,et al.  Multilingual Spoken Language Corpus Development for Communication Research , 2006, ROCLING/IJCLCLP.

[23]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[24]  David Chiang,et al.  Tied Multitask Learning for Neural Speech Translation , 2018, NAACL.

[25]  J. Macswan,et al.  The architecture of the bilingual language faculty: evidence from intrasentential code switching , 2000, Bilingualism: Language and Cognition.

[26]  Sandra Fotos,et al.  Japanese-English Code Switching in Bilingual Children , 1990 .