Rapid feature space MLLR speaker adaptation for deep neural network acoustic modeling

Bilinear models based feature space Maximum Likelihood Linear Regression (FMLLR) speaker adaptation have showed good performance for GMM-HMMs especially when the amount of adaptation data is limited. In this paper, we propose using bilinear models feature as inputs to deep neural networks (DNNs) for rapid speaker adaptation of acoustic modeling to facilitate utterance-level normalization. The effectiveness of the proposed method is demonstrated with experiments on the Mandarin short message dictation and voice query dataset.

[1]  Steve Renals,et al.  Differentiable pooling for unsupervised speaker adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Xiaodong Cui,et al.  Developing speech recognition systems for corpus indexing under the IARPA Babel program , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[4]  Xiaodong Cui,et al.  A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Ivor W. Tsang,et al.  Maximum Penalized Likelihood Kernel Regression for Fast Adaptation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[7]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[8]  Peder A. Olsen,et al.  Rapid feature space MLLR speaker adaptation with bilinear models , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Florian Metze,et al.  Towards speaker adaptive training of deep neural network acoustic models , 2014, INTERSPEECH.

[10]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Yong Qin,et al.  Model dimensionality selection in bilinear transformation for feature space MLLR rapid speaker adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[13]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[14]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[15]  Khe Chai Sim,et al.  An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  George Saon,et al.  The IBM BOLT speech transcription system , 2015, INTERSPEECH.

[17]  Yu Zhang,et al.  Recent advances in ASR applied to an Arabic transcription system for Al-Jazeera , 2014, INTERSPEECH.

[18]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Steve Renals,et al.  Multi-level adaptive networks in tandem and hybrid ASR systems , 2013, ICASSP.

[20]  Xiaodong Cui,et al.  Data augmentation for deep convolutional neural network acoustic modeling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tara N. Sainath,et al.  Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).