Online Speaker Adaptation for LVCSR Based on Attention Mechanism

Speaker adaptation is one of the most popular and important topics for speech recognition. In this paper, we propose a novel online speaker adaptation technique for deep neural networks based large vocabulary automatic speech recognition (LVCSR). In this approach, the i-vectors of the speakers in training set are extracted as a static memory. For each frame, attention mechanism is used to select the most relevant speaker i-vectors to the current speech segment from the memory. We also propose a new attention mechanism to improve the performance. The vectors obtained by the attention mechanism provide speaker information for improving the accuracy of speech recognition. Experiments on the Switchboard task show that the proposed approach achieves a relative 8.3% word error rate (WER) reduction over speaker independent model without any adaptation data. The result is comparable to that of the popular i-vector based offline speaker adaption method and is much better than that of the i-vector based online speaker adaption method.

[1]  Khe Chai Sim,et al.  learning Effective Factorized Hidden Layer Bases Using Student-Teacher Training for LSTM Acoustic Model Adaptation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Yongqiang Wang,et al.  Adaptation of deep neural network acoustic models using factorised i-vectors , 2014, INTERSPEECH.

[4]  Li-Rong Dai,et al.  Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[7]  Ji Wu,et al.  Rapid adaptation for deep neural networks through multi-task learning , 2015, INTERSPEECH.

[8]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[11]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[12]  Li-Rong Dai,et al.  Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Yifan Gong,et al.  Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  DeLiang Wang,et al.  Unsupervised speaker adaptation of batch normalized acoustic models for robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Yu Zhang,et al.  Speaker adaptation using the i-vector technique for bottleneck features , 2015, INTERSPEECH.

[17]  Yongqiang Wang,et al.  Investigations on speaker adaptation of LSTM RNN models for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Kai Yu,et al.  Cluster adaptive training for deep neural network , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.