论文信息 - Data Augmentation for end-to-end Code-Switching Speech Recognition

Data Augmentation for end-to-end Code-Switching Speech Recognition

Training a code-switching end-to-end automatic speech recognition (ASR) model normally requires a large amount of data, while code-switching data is often limited. In this paper, three novel approaches are proposed for code-switching data augmentation. Specifically, they are audio splicing with the existing code-switching data, and TTS with new code-switching texts generated by word translation or word insertion. Our experiments on 200 hours Mandarin-English code-switching dataset show that all the three proposed approaches yield significant improvements on code-switching ASR individually. Moreover, all the proposed approaches can be combined with recent popular SpecAugment, and an addition gain can be obtained. WER is significantly reduced by relative 24.0% compared to the system without any data augmentation, and still relative 13.0% gain compared to the system with only SpecAugment

[1] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[2] Sunil Kumar Kopparapu,et al. Mixed Language Speech Recognition without Explicit Identification of Language , 2012 .

[3] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[4] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Yongqiang Wang,et al. An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[7] Marta R. Costa-jussà,et al. End-to-End Speech Translation with the Transformer , 2018, IberSPEECH.

[8] Xu Tan,et al. FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[9] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[10] Navdeep Jaitly,et al. Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[11] Yifan Gong,et al. Towards Code-switching ASR for End-to-end CTC Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Tan Lee,et al. Detection of language boundary in code-switching utterances by bi-phone probabilities , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[13] Shuai Zhang,et al. Rnn-transducer With Language Bias For End-to-end Mandarin-English Code-switching Speech Recognition , 2020, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[14] Jonathan Le Roux,et al. An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[16] Dongwoo Kim,et al. word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs , 2019, LREC.

[17] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[18] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[19] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[20] David A. van Leeuwen,et al. Semi-supervised acoustic model training for speech with code-switching , 2018, Speech Commun..

[21] Bin Ma,et al. Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data , 2019, INTERSPEECH.

[22] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.

[23] Chng Eng Siong,et al. Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition , 2018, INTERSPEECH.

[24] Dong Yu,et al. Investigating End-to-end Speech Recognition for Mandarin-english Code-switching , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[26] John R. Hershey,et al. Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[27] Pascale Fung,et al. Towards End-to-end Automatic Code-Switching Speech Recognition , 2018, ArXiv.

[28] Xiaofei Wang,et al. A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[29] Haizhou Li,et al. On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition , 2018, INTERSPEECH.

[30] Jae S. Lim,et al. Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[31] Kai Yu,et al. Speaker Augmentation for Low Resource Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).