论文信息 - CTC regularized model adaptation for improving LSTM RNN based multi-accent Mandarin speech recognition

CTC regularized model adaptation for improving LSTM RNN based multi-accent Mandarin speech recognition

This paper proposes a novel regularized adaptation method for long short term memory (LSTM) recurrent neural network (RNN) based acoustic model trained with connectionist temporal classification (CTC) loss function (LSTM-RNN-CTC) to improve the performance of multi-accent Mandarin speech recognition task. In general, directly adjusting the network parameters with a small adaptation set may lead to over-fitting. In order to avoid this problem, we add a regularization term to the original training criterion. It forces the conditional probability distribution over initial and final (I/F) sequences estimated from the adapted model to be close to the accent independent (AI) model. Meanwhile, hidden layers of LSTM RNN should not be adjusted, but only the accent-specific output layer needs to be fine-tuned using this adaptation method. Experiments on RASC863 and CASIA regional accent speech corpus show that the proposed method obtains obvious improvement when compared with LSTM-RNN-CTC baseline model.

Bin Liu | Hao Ni | Jianhua Tao | Jiangyan Yi | Zhengqi Wen

[1] Yanpeng Li,et al. Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer , 2015, INTERSPEECH.

[2] Souvik Kundu,et al. Speaker-aware training of LSTM-RNNS for acoustic modelling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4] Dimitra Vergyri,et al. Automatic speech recognition of multiple accented English data , 2010, INTERSPEECH.

[5] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6] Yi Su,et al. Accent detection and speech recognition for Shanghai-accented Mandarin , 2005, INTERSPEECH.

[7] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] Andrew W. Senior,et al. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[9] Yifan Gong,et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Tao Chen,et al. Accent Issues in Large Vocabulary Continuous Speech Recognition , 2004, Int. J. Speech Technol..

[11] Kaisheng Yao,et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12] Yongqiang Wang,et al. Investigations on speaker adaptation of LSTM RNN models for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Li-Rong Dai,et al. Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Pascale Fung,et al. Multi-accent Chinese speech recognition , 2006, INTERSPEECH.

[15] J. Hansen,et al. A STUDY OF TEMPORAL FEATURES AND FREQUENCY CHARACTERISTICS IN AMERICAN ENGLISH FOREIGN ACCENT , 1997 .

[16] Tanja Schultz,et al. Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[18] Yifan Gong,et al. Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation , 2014, INTERSPEECH.

[19] Yi Liu,et al. Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.