Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain

The phenomenon where a speaker mixes two or more languages within the same conversation is called code-switching (CS). Handling CS is challenging for automatic speech recognition (ASR) and textto-speech (TTS) because it requires coping with multilingual input. Although CS text or speech may be found in social media, the datasets of CS speech and corresponding CS transcriptions are hard to obtain even though they are required for supervised training. This work adopts a deep learning-based machine speech chain to train CS ASR and CS TTS with each other with semisupervised learning. After supervised learning with monolingual data, the machine speech chain is then carried out with unsupervised learning of either the CS text or speech. The results show that the machine speech chain trains ASR and TTS together and improves performance without requiring the pair of CS speech and corresponding CS text. We also integrate language embedding and language identification into the CS machine speech chain in order to handle CS better by giving language information. We demonstrate that our proposed approach can improve the performance on both a single CS language pair and multiple CS language pairs, including the unknown CS excluded from training data. key words: ASR, code-switching, language identification, semisupervised learning, TTS, machine speech chain

[1]  Alan W. Black,et al.  On Building Mixed Lingual Speech Synthesis Systems , 2017, INTERSPEECH.

[2]  Haizhou Li,et al.  On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition , 2018, INTERSPEECH.

[3]  Sandra Fotos,et al.  Japanese-English Code Switching in Bilingual Children , 1990 .

[4]  Chng Eng Siong,et al.  Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition , 2018, INTERSPEECH.

[5]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Hervé Bourlard,et al.  Language dependent universal phoneme posterior estimation for mixed language speech recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[8]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Olivier Rosec,et al.  SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set , 2017, INTERSPEECH 2017.

[10]  P. Denes,et al.  The speech chain : the physics and biology of spoken language , 1963 .

[11]  Jonathan G. Fiscus,et al.  Tools for the analysis of benchmark speech recognition tests , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[12]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Eiichiro Sumita,et al.  Multilingual Spoken Language Corpus Development for Communication Research , 2006, ROCLING/IJCLCLP.

[15]  Tien Ping Tan,et al.  Automatic Speech Recognition of Code Switching Speech Using 1-Best Rescoring , 2012, 2012 International Conference on Asian Language Processing.

[16]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[17]  Mark Hasegawa-Johnson,et al.  Image 2 speech : Automatically generating audio descriptions of images , 2017 .

[18]  Satoshi Nakamura,et al.  Zero-Shot Code-Switching ASR and TTS with Multilingual Machine Speech Chain , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Satoshi Nakamura,et al.  End-to-end Feedback Loss in Speech Chain Framework via Straight-through Estimator , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Satoshi Nakamura,et al.  Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Sanjeev Khudanpur,et al.  An investigation of acoustic models for multilingual code-switching , 2008, INTERSPEECH.

[23]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[24]  Alan W. Black,et al.  Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text , 2016, SSW.

[25]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[26]  Pascale Fung,et al.  Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences , 2019, CoNLL.

[27]  Basil Abraham,et al.  Improving Low Resource Code-switched ASR using Augmented Code-switched TTS , 2020, INTERSPEECH.

[28]  Shinji Watanabe,et al.  Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Masayo Nakamura Developing Codeswitching Patterns of a Japanese / English Bilingual Child , 2005 .

[30]  Taku Kudo,et al.  MeCab : Yet Another Part-of-Speech and Morphological Analyzer , 2005 .

[31]  Satoshi Nakamura,et al.  Japanese-English Code-Switching Speech Data Construction , 2018, 2018 Oriental COCOSDA - International Conference on Speech Database and Assessments.

[32]  Xiangang Li,et al.  Towards End-to-End Code-Switching Speech Recognition , 2018, ArXiv.

[33]  Satoshi Nakamura,et al.  Machine Speech Chain with One-shot Speaker Adaptation , 2018, INTERSPEECH.

[34]  Shana Poplack,et al.  Code Switching: Linguistic , 2001 .

[35]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Alan W. Black,et al.  Speech Synthesis of Code-Mixed Text , 2016, LREC.

[37]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[38]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[39]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[40]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[41]  Zhili Tan,et al.  Addressing Accent Mismatch In Mandarin-English Code-Switching Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Yuan Cao,et al.  Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  David A. van Leeuwen,et al.  Investigating Bilingual Deep Neural Networks for Automatic Recognition of Code-switching Frisian Speech , 2016, SLTU.

[44]  Dong Yu,et al.  Investigating End-to-end Speech Recognition for Mandarin-english Code-switching , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Yong Zhao,et al.  Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[46]  Jonathan Le Roux,et al.  An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Haizhou Li,et al.  SEAME: a Mandarin-English code-switching speech corpus in south-east asia , 2010, INTERSPEECH.

[48]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[49]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[50]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[51]  Frank K. Soong,et al.  An HMM-based bilingual (Mandarin-English) TTS , 2007, SSW.