CTC regularized model adaptation for improving LSTM RNN based multi-accent Mandarin speech recognition

This paper proposes a novel regularized adaptation method for long short term memory (LSTM) recurrent neural network (RNN) based acoustic model trained with connectionist temporal classification (CTC) loss function (LSTM-RNN-CTC) to improve the performance of multi-accent Mandarin speech recognition task. In general, directly adjusting the network parameters with a small adaptation set may lead to over-fitting. In order to avoid this problem, we add a regularization term to the original training criterion. It forces the conditional probability distribution over initial and final (I/F) sequences estimated from the adapted model to be close to the accent independent (AI) model. Meanwhile, hidden layers of LSTM RNN should not be adjusted, but only the accent-specific output layer needs to be fine-tuned using this adaptation method. Experiments on RASC863 and CASIA regional accent speech corpus show that the proposed method obtains obvious improvement when compared with LSTM-RNN-CTC baseline model.

[1]  Yanpeng Li,et al.  Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer , 2015, INTERSPEECH.

[2]  Souvik Kundu,et al.  Speaker-aware training of LSTM-RNNS for acoustic modelling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Dimitra Vergyri,et al.  Automatic speech recognition of multiple accented English data , 2010, INTERSPEECH.

[5]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6]  Yi Su,et al.  Accent detection and speech recognition for Shanghai-accented Mandarin , 2005, INTERSPEECH.

[7]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[9]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Tao Chen,et al.  Accent Issues in Large Vocabulary Continuous Speech Recognition , 2004, Int. J. Speech Technol..

[11]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Yongqiang Wang,et al.  Investigations on speaker adaptation of LSTM RNN models for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Li-Rong Dai,et al.  Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Pascale Fung,et al.  Multi-accent Chinese speech recognition , 2006, INTERSPEECH.

[15]  J. Hansen,et al.  A STUDY OF TEMPORAL FEATURES AND FREQUENCY CHARACTERISTICS IN AMERICAN ENGLISH FOREIGN ACCENT , 1997 .

[16]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[18]  Yifan Gong,et al.  Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation , 2014, INTERSPEECH.

[19]  Yi Liu,et al.  Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.