Factorized Hidden Layer Adaptation for Deep Neural Network Based Acoustic Modeling

In this paper, we propose the factorized hidden layer (FHL) approach to adapt the deep neural network (DNN) acoustic models for automatic speech recognition (ASR). FHL aims at modeling speaker dependent (SD) hidden layers by representing an SD affine transformation as a linear combination of bases. The combination weights are low-dimensional speaker parameters that can be initialized using speaker representations like i-vectors and then reliably refined in an unsupervised adaptation fashion. Therefore, our method provides an efficient way to perform both adaptive training and (test-time) adaptation. Experimental results have shown that the FHL adaptation improves the ASR performance significantly, compared to the standard DNN models, as well as other state-of-the-art DNN adaptation approaches, such as training with the speaker-normalized CMLLR features, speaker-aware training using i-vector and learning hidden unit contributions (LHUC). For Aurora 4, FHL achieves 3.8% and 2.3% absolute improvements over the standard DNNs trained on the LDA + STC and CMLLR features, respectively. It also achieves 1.7% absolute performance improvement over a system that combines the i-vector adaptive training with LHUC adaptation. For the AMI dataset, FHL achieved 1.4% and 1.9% absolute improvements over the sequence-trained CMLLR baseline systems, for the IHM and SDM tasks, respectively.

[1]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Kai Yu,et al.  Cluster adaptive training for deep neural network , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Li-Rong Dai,et al.  Speaker adaptation OF RNN-BLSTM for speech recognition based on speaker code , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yifan Gong,et al.  Regularized sequence-level deep neural network model adaptation , 2015, INTERSPEECH.

[5]  Khe Chai Sim,et al.  Learning factorized feature transforms for speaker normalization , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6]  Koichi Shinoda,et al.  Speaker adaptation of deep neural networks using a hierarchy of output layers , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[7]  John H. L. Hansen,et al.  Robust i-vector extraction for neural network adaptation in noisy environment , 2015, INTERSPEECH.

[8]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[9]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Sree Hari Krishnan Parthasarathi,et al.  Robust i-vector based adaptation of DNN acoustic model for speech recognition , 2015, INTERSPEECH.

[11]  Navdeep Jaitly,et al.  Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[12]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[13]  Yifan Gong,et al.  Low-rank plus diagonal adaptation for deep neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Mark J. F. Gales,et al.  Multi-basis adaptive neural network for rapid adaptation in speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Thomas Niesler,et al.  The 1998 HTK system for transcription of conversational telephone speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[17]  S. M. Siniscalchi,et al.  Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Themos Stafylakis,et al.  I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Steve Renals,et al.  Differentiable pooling for unsupervised speaker adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Stéphane Dupont,et al.  Fast speaker adaptation of artificial neural networks for automatic speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[22]  Mark J. F. Gales,et al.  I-vector estimation using informative priors for adaptation of deep neural networks , 2015, INTERSPEECH.

[23]  Gerhard Rigoll,et al.  Two-stage speaker adaptation of hybrid tied-posterior acoustic models , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[24]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[25]  Hervé Bourlard,et al.  Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[26]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[27]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[28]  Xiao Li,et al.  Regularized Adaptation of Discriminative Classifiers , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[29]  Kaisheng Yao,et al.  Intermediate-layer DNN adaptation for offline and session-based iterative speaker adaptation , 2015, INTERSPEECH.

[30]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[31]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[32]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[33]  Yifan Gong,et al.  Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Khe Chai Sim,et al.  On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Li-Rong Dai,et al.  Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Steve Renals,et al.  Multi-level adaptive networks in tandem and hybrid ASR systems , 2013, ICASSP.

[37]  Pietro Laface,et al.  Adaptation of Hybrid ANN/HMM Models Using Linear Hidden Transformations and Conservative Training , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[38]  Marc Delcroix,et al.  Joint acoustic factor learning for robust deep neural network based automatic speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[40]  Ji Wu,et al.  Rapid adaptation for deep neural networks through multi-task learning , 2015, INTERSPEECH.

[41]  Geoffrey Zweig,et al.  An introduction to computational networks and the computational network toolkit (invited talk) , 2014, INTERSPEECH.

[42]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Li-Rong Dai,et al.  Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition , 2014, Journal of Signal Processing Systems.

[44]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[45]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[46]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[47]  Peter Bell,et al.  Structured output layer with auxiliary targets for context-dependent acoustic modelling , 2015, INTERSPEECH.

[48]  Yongqiang Wang,et al.  Adaptation of deep neural network acoustic models using factorised i-vectors , 2014, INTERSPEECH.

[49]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[50]  Khe Chai Sim,et al.  On combining i-vectors and discriminative adaptation methods for unsupervised speaker normalization in DNN acoustic models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Yifan Gong,et al.  Restructuring of deep neural network acoustic models with singular value decomposition , 2013, INTERSPEECH.

[52]  Dong Yu,et al.  FACTORIZED DEEP NEURAL NETWORKS FOR ADAPTIVE SPEECH RECOGNITION , 2012 .

[53]  Florian Metze,et al.  Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[54]  Léon Bottou,et al.  On-line learning and stochastic approximations , 1999 .

[55]  Jeff Siu-Kei Au-Yeung,et al.  Improved performance of Aurora 4 using HTK and unsupervised MLLR adaptation , 2004, INTERSPEECH.

[56]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[57]  Horacio Franco,et al.  Connectionist speaker normalization and adaptation , 1995, EUROSPEECH.

[58]  Tara N. Sainath,et al.  Deep Convolutional Neural Networks for Large-scale Speech Tasks , 2015, Neural Networks.

[59]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[60]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[61]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[62]  Yonghong Yan,et al.  A Initial Attempt on Task-Specific Adaptation for Deep Neural Network-based Large Vocabulary Continuous Speech Recognition , 2012, INTERSPEECH.

[63]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[64]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[65]  Khe Chai Sim,et al.  An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[66]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[67]  Chengzhu Yu,et al.  Context adaptive deep neural networks for fast acoustic model adaptation in noisy conditions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  Naveen Parihar,et al.  Performance analysis of the Aurora large vocabulary baseline system , 2004, 2004 12th European Signal Processing Conference.

[69]  Khe Chai Sim,et al.  Temporally Varying Weight Regression: A Semi-Parametric Trajectory Model for Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[70]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[71]  Jan Zelinka,et al.  Adaptation of a Feedforward Artificial Neural Network Using a Linear Transform , 2010, TSD.

[72]  Steve Renals,et al.  SAT-LHUC: Speaker adaptive training for learning hidden unit contributions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[73]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[75]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..