Speaker adaptation using speaker-normalized DNN based on speaker codes

Recently, deep neural network (DNN) becomes one of the main streams of acoustic modeling for automatic speech recognition. Further, speaker adaptation techniques have been tested for DNN-based speech recognition, including one based on a framework of bias adaptation using speaker codes. This paper introduces speaker-normalized training to this framework and experimentally shows its effectiveness. In the conventional method using speaker codes, two kinds of networks of speaker-independent (SI) DNNs and subnetworks for speaker adaptation were trained sequentially. We expect that, by training the SI networks and the subnetworks simultaneously, this method can be tuned so that it can handle both SI information and speaker-dependent (SD) information more adequately. Further, different from the conventional method, the speaker code vector is generated through networks from a 1-of-N speaker representation. This will reduce the training cost of the SI models and the subnetworks and avoid the over-fitting problem. Experimental evaluations using the TIMIT database demonstrate that our proposed training method can reduce the phoneme error rate by 5.7% relative.

[1]  Li-Rong Dai,et al.  Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4]  Mark J. F. Gales,et al.  Investigation of unsupervised adaptation of DNN acoustic models with filter bank input , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Shigeru Katagiri,et al.  Speaker Adaptive Training using Deep Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Yifan Gong,et al.  Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[10]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.