Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS

Code-switching (CS) speech, in which speakers alternate between two or more languages in the same utterance, often occurs in multilingual communities. Such a phenomenon poses challenges for spoken language technologies: automatic speech recognition (ASR) and text-to-speech synthesis (TTS), since the systems need to be able to handle the input in a multilingual setting. We may find code-switching text or code-switching speech in social media, but parallel speech and the transcriptions of code-switching data, which are suitable for training ASR and TTS, are generally unavailable. In this paper, we utilize a speech chain framework based on deep learning to enable ASR and TTS to learn code-switching in a semi-supervised fashion. We base our system on Japanese-English conversational speech. We first separately train the ASR and TTS systems with parallel speech-text of monolingual data (supervised learning) and perform a speech chain with only code-switching text or code-switching speech (unsupervised learning). Experimental results reveal that such closed-loop architecture allows ASR and TTS to learn from each other and improve the performance even without any parallel code-switching data.

[1]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  David A. van Leeuwen,et al.  Investigating Bilingual Deep Neural Networks for Automatic Recognition of Code-switching Frisian Speech , 2016, SLTU.

[3]  Hervé Bourlard,et al.  Language dependent universal phoneme posterior estimation for mixed language speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Alan W. Black,et al.  On Building Mixed Lingual Speech Synthesis Systems , 2017, INTERSPEECH.

[5]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[6]  Eiichiro Sumita,et al.  Multilingual Spoken Language Corpus Development for Communication Research , 2006, ROCLING/IJCLCLP.

[7]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  P. Denes,et al.  The speech chain : the physics and biology of spoken language , 1963 .

[10]  Sanjeev Khudanpur,et al.  An investigation of acoustic models for multilingual code-switching , 2008, INTERSPEECH.

[11]  Alan W. Black,et al.  Speech Synthesis of Code-Mixed Text , 2016, LREC.

[12]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Satoshi Nakamura,et al.  Japanese-English Code-Switching Speech Data Construction , 2018, 2018 Oriental COCOSDA - International Conference on Speech Database and Assessments.

[14]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[15]  Jonathan Le Roux,et al.  An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Satoshi Nakamura,et al.  Machine Speech Chain with One-shot Speaker Adaptation , 2018, INTERSPEECH.

[17]  Alan W. Black,et al.  Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text , 2016, SSW.

[18]  Sandra Fotos,et al.  Japanese-English Code Switching in Bilingual Children , 1990 .

[19]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tien Ping Tan,et al.  Automatic Speech Recognition of Code Switching Speech Using 1-Best Rescoring , 2012, 2012 International Conference on Asian Language Processing.

[22]  Masayo Nakamura Developing Codeswitching Patterns of a Japanese / English Bilingual Child , 2005 .

[23]  Yong Zhao,et al.  Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[24]  J. Macswan,et al.  The architecture of the bilingual language faculty: evidence from intrasentential code switching , 2000, Bilingualism: Language and Cognition.

[25]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.