Cluster Adaptive Training for Deep Neural Network Based Acoustic Model

Although context-dependent DNN-HMM systems have achieved significant improvements over GMM-HMM systems, significant performance degradation has been observed if the acoustic condition of the test data mismatches that of the training data. Hence, adaptation and adaptive training of DNN are of great research interest. Previous DNN adaptation works mainly focus on adapting parameters of a single DNN by applying linear transformations to feature or hidden-layer output; introducing vector representation of non-speech variability into the input. In these methods, large number of parameters are required to be estimated during adaptation. In this paper, the cluster adaptive training (CAT) framework is employed for DNN adaptive training. Here, multiple weight matrices are constructed to form the basis of a canonical parametric space. During adaptation, for a new acoustic condition, an interpolation vector is estimated to combine the weight basis into a single adapted weight matrix. Since only the interpolation vector need to be estimated during adaptation, the number of updated parameters is much smaller than existing DNN adaptation methods. The CAT-DNN approach was evaluated on an English switchboard task in unsupervised adaptation mode. It achieved significant WER reductions over the unadapted DNN-HMM, relative 7.6% to 10.6%, with only 10 parameters.

[1]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[4]  Mark J. F. Gales Cluster adaptive training for speech recognition , 1998, ICSLP.

[5]  Themos Stafylakis,et al.  I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Shigeru Katagiri,et al.  Speaker Adaptive Training using Deep Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[9]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[10]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Xiaodong Cui,et al.  Data augmentation for deep convolutional neural network acoustic modeling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[13]  Li-Rong Dai,et al.  Unsupervised speaker adaptation of deep neural network based on the combination of speaker codes and singular value decomposition for speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Li-Rong Dai,et al.  Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[16]  Jan Zelinka,et al.  Adaptation of a Feedforward Artificial Neural Network Using a Linear Transform , 2010, TSD.

[17]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Mark J. F. Gales,et al.  Multi-basis adaptive neural network for rapid adaptation in speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Steve Renals,et al.  Differentiable pooling for unsupervised speaker adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Thomas Hain,et al.  An investigation into speaker informed DNN front-end for LVCSR , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Kai Yu,et al.  Cluster adaptive training for deep neural network , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[24]  Steven Wegmann,et al.  On the importance of modeling and robustness for deep neural network feature , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Pietro Laface,et al.  Linear hidden transformations for adaptation of hybrid ANN/HMM models , 2007, Speech Commun..

[26]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[28]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[29]  Li-Rong Dai,et al.  Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition , 2014, Journal of Signal Processing Systems.

[30]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[31]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[32]  Yifan Gong,et al.  A comparative analytic study on the Gaussian mixture and context dependent deep neural network hidden Markov models , 2014, INTERSPEECH.

[33]  Li-Rong Dai,et al.  Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Yonghong Yan,et al.  A Initial Attempt on Task-Specific Adaptation for Deep Neural Network-based Large Vocabulary Continuous Speech Recognition , 2012, INTERSPEECH.

[35]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[36]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[37]  Khe Chai Sim,et al.  An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .