Data Augmentation for end-to-end Code-Switching Speech Recognition

Training a code-switching end-to-end automatic speech recognition (ASR) model normally requires a large amount of data, while code-switching data is often limited. In this paper, three novel approaches are proposed for code-switching data augmentation. Specifically, they are audio splicing with the existing code-switching data, and TTS with new code-switching texts generated by word translation or word insertion. Our experiments on 200 hours Mandarin-English code-switching dataset show that all the three proposed approaches yield significant improvements on code-switching ASR individually. Moreover, all the proposed approaches can be combined with recent popular SpecAugment, and an addition gain can be obtained. WER is significantly reduced by relative 24.0% compared to the system without any data augmentation, and still relative 13.0% gain compared to the system with only SpecAugment

[1]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[2]  Sunil Kumar Kopparapu,et al.  Mixed Language Speech Recognition without Explicit Identification of Language , 2012 .

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Marta R. Costa-jussà,et al.  End-to-End Speech Translation with the Transformer , 2018, IberSPEECH.

[8]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[9]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[10]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[11]  Yifan Gong,et al.  Towards Code-switching ASR for End-to-end CTC Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Tan Lee,et al.  Detection of language boundary in code-switching utterances by bi-phone probabilities , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[13]  Shuai Zhang,et al.  Rnn-transducer With Language Bias For End-to-end Mandarin-English Code-switching Speech Recognition , 2020, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[14]  Jonathan Le Roux,et al.  An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[16]  Dongwoo Kim,et al.  word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs , 2019, LREC.

[17]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[18]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[19]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[20]  David A. van Leeuwen,et al.  Semi-supervised acoustic model training for speech with code-switching , 2018, Speech Commun..

[21]  Bin Ma,et al.  Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data , 2019, INTERSPEECH.

[22]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[23]  Chng Eng Siong,et al.  Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition , 2018, INTERSPEECH.

[24]  Dong Yu,et al.  Investigating End-to-end Speech Recognition for Mandarin-english Code-switching , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[26]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[27]  Pascale Fung,et al.  Towards End-to-end Automatic Code-Switching Speech Recognition , 2018, ArXiv.

[28]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[29]  Haizhou Li,et al.  On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition , 2018, INTERSPEECH.

[30]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[31]  Kai Yu,et al.  Speaker Augmentation for Low Resource Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).