论文信息 - Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS

Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS

Code-switching (CS) speech, in which speakers alternate between two or more languages in the same utterance, often occurs in multilingual communities. Such a phenomenon poses challenges for spoken language technologies: automatic speech recognition (ASR) and text-to-speech synthesis (TTS), since the systems need to be able to handle the input in a multilingual setting. We may find code-switching text or code-switching speech in social media, but parallel speech and the transcriptions of code-switching data, which are suitable for training ASR and TTS, are generally unavailable. In this paper, we utilize a speech chain framework based on deep learning to enable ASR and TTS to learn code-switching in a semi-supervised fashion. We base our system on Japanese-English conversational speech. We first separately train the ASR and TTS systems with parallel speech-text of monolingual data (supervised learning) and perform a speech chain with only code-switching text or code-switching speech (unsupervised learning). Experimental results reveal that such closed-loop architecture allows ASR and TTS to learn from each other and improve the performance even without any parallel code-switching data.

[1] Tara N. Sainath,et al. Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] David A. van Leeuwen,et al. Investigating Bilingual Deep Neural Networks for Automatic Recognition of Code-switching Frisian Speech , 2016, SLTU.

[3] Hervé Bourlard,et al. Language dependent universal phoneme posterior estimation for mixed language speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Alan W. Black,et al. On Building Mixed Lingual Speech Synthesis Systems , 2017, INTERSPEECH.

[5] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[6] Eiichiro Sumita,et al. Multilingual Spoken Language Corpus Development for Communication Research , 2006, ROCLING/IJCLCLP.

[7] Eiichiro Sumita,et al. Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[8] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9] P. Denes,et al. The speech chain : the physics and biology of spoken language , 1963 .

[10] Sanjeev Khudanpur,et al. An investigation of acoustic models for multilingual code-switching , 2008, INTERSPEECH.

[11] Alan W. Black,et al. Speech Synthesis of Code-Mixed Text , 2016, LREC.

[12] Satoshi Nakamura,et al. Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13] Satoshi Nakamura,et al. Japanese-English Code-Switching Speech Data Construction , 2018, 2018 Oriental COCOSDA - International Conference on Speech Database and Assessments.

[14] Samy Bengio,et al. Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[15] Jonathan Le Roux,et al. An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Satoshi Nakamura,et al. Machine Speech Chain with One-shot Speaker Adaptation , 2018, INTERSPEECH.

[17] Alan W. Black,et al. Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text , 2016, SSW.

[18] Sandra Fotos,et al. Japanese-English Code Switching in Bilingual Children , 1990 .

[19] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Haizhou Li,et al. A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Tien Ping Tan,et al. Automatic Speech Recognition of Code Switching Speech Using 1-Best Rescoring , 2012, 2012 International Conference on Asian Language Processing.

[22] Masayo Nakamura. Developing Codeswitching Patterns of a Japanese / English Bilingual Child , 2005 .

[23] Yong Zhao,et al. Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[24] J. Macswan,et al. The architecture of the bilingual language faculty: evidence from intrasentential code switching , 2000, Bilingualism: Language and Cognition.

[25] Yoshua Bengio,et al. End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Tianqi Chen,et al. Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.