Addressing Accent Mismatch In Mandarin-English Code-Switching Speech Recognition

Automatic speech recognition systems suffer from accuracy degradation when code-switching (multiple languages are spoken in a single utterance) is encountered. This is especially common for non-native speakers where there is a mismatch between speech and acoustic model. In this paper, we experiment on Mandarin-English code-switching audio spoken by native Chinese speakers and evaluate three techniques to improve accuracy–data adaptation, individual senone modeling and lexicon enrichment. Our results show the recognition of accented speech improves up to 12% on various code-switching datasets. We also propose several metrics to measure code-switching recognition quality, not captured in typical word error rate (WER) measurement.

[1]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Dong Yu,et al.  Investigating End-to-end Speech Recognition for Mandarin-english Code-switching , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Chng Eng Siong,et al.  Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition , 2018, INTERSPEECH.

[6]  Matt Shannon,et al.  Optimizing Expected Word Error Rate via Sampling for Speech Recognition , 2017, INTERSPEECH.

[7]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[8]  Yifan Gong,et al.  Towards Code-switching ASR for End-to-end CTC Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  David A. van Leeuwen,et al.  Acoustic and Textual Data Augmentation for Improved ASR of Code-Switching Speech , 2018, INTERSPEECH.

[10]  Haizhou Li,et al.  On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition , 2018, INTERSPEECH.

[11]  Yuchen Zhang,et al.  Exploring Retraining-free Speech Recognition for Intra-sentential Code-switching , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Shun-Po Chuang,et al.  Code-switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation , 2018, INTERSPEECH.

[13]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Zhijie Yan,et al.  Improving latency-controlled BLSTM acoustic models for online speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Chung-Hsien Wu,et al.  Generation of Phonetic Units for Mixed-Language Speech Recognition Based on Acoustic and Contextual Analysis , 2007, IEEE Transactions on Computers.

[17]  Pascale Fung,et al.  Towards End-to-end Automatic Code-Switching Speech Recognition , 2018, ArXiv.

[18]  Hermann Ney,et al.  RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[19]  Hui Lin,et al.  A study on multilingual acoustic modeling for large vocabulary ASR , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Junlan Feng,et al.  Code-Switching Sentence Generation by Bert and Generative Adversarial Networks , 2019, INTERSPEECH.

[21]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.