Unsupervised speaker adaptation of batch normalized acoustic models for robust ASR

Batch normalization is a standard technique for training deep neural networks. In batch normalization, the input of each hidden layer is first mean-variance normalized and then linearly transformed before applying non-linear activation functions. We propose a novel unsupervised speaker adaptation technique for batch normalized acoustic models. The key idea is to adjust the linear transformations previously learned by batch normalization for all the hidden layers according to the first-pass decoding results of the speaker-independent model. With the adjusted linear transformations for each test speaker, the test distribution of the input of each hidden layer better matches the training distribution. Experiments on the CHiME-3 dataset demonstrate the effectiveness of the proposed layer-wise adaptation approach. Our overall system obtains 4.24% WER on the real subset of the test data, which represents the best reported result on this dataset to date and a relative 27.3% error reduction over the previous best result.

[1]  DeLiang Wang,et al.  Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[6]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Yifan Gong,et al.  Non-negative intermediate-layer DNN adaptation for a 10-KB speaker adaptation profile , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Zhong-Qiu Wang,et al.  A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[12]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jonathan Le Roux,et al.  The MERL/SRI system for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  DeLiang Wang,et al.  Investigation of Speech Separation as a Front-End for Noise Robust Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Florian Metze,et al.  Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[17]  Bo Ren,et al.  Robust speech recognition using beamforming with adaptive microphone gains and multichannel noise reduction , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[20]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  References , 1971 .

[22]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[23]  C. Zhang,et al.  DNN speaker adaptation using parameterised sigmoid and ReLU hidden activation functions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[25]  Yifan Gong,et al.  Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Steve Renals,et al.  Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[29]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[30]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[31]  Souvik Kundu,et al.  Speaker-aware training of LSTM-RNNS for acoustic modelling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[33]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[35]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..