FMLLR Speaker Normalization With i-Vector: In Pseudo-FMLLR and Distillation Framework

When an automatic speech recognition (ASR) system is deployed for real-world applications, it often receives only one utterance at a time for decoding. This single utterance could be of short duration depending on the ASR task. In these cases, robust estimation of speaker normalizing methods like feature-space maximum likelihood linear regression (FMLLR) and i-vectors may not be feasible. In this paper, we propose two unsupervised speaker normalization techniques—one at feature level and other at model level of acoustic modeling—to overcome the drawbacks of FMLLR and i-vectors in real-time scenarios. At feature level, we propose the use of deep neural networks (DNN) to generate pseudo-FMLLR features from time-synchronous pair of filterbank and FMLLR features. These pseudo-FMLLR features can then be used for DNN acoustic model training and decoding. At model level, we propose a generalized distillation framework, where a teacher DNN trained on FMLLR features guides the training and optimization of a student DNN trained on filterbank features. In both the proposed methods, the ambiguity in choosing the speaker-specific FMLLR transform can be reduced by augmenting i-vectors to the input filterbank features. Experiments conducted on 33-h and 110-h subsets of Switchboard corpus show that the proposed methods provide significant gains over DNNs trained on FMLLR, i-vector appended FMLLR, filterbank and i -vector appended filterbank features, in real-time scenario.

[1]  James R. Glass,et al.  Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Jun Du,et al.  Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[4]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[7]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Tomoko Matsui,et al.  Robust Speech Recognition Using Generalized Distillation Framework , 2016, INTERSPEECH.

[9]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Vladimir Vapnik,et al.  A new learning paradigm: Learning using privileged information , 2009, Neural Networks.

[11]  Neethu Mariam Joy,et al.  Generalized Distillation Framework for Speaker Normalization , 2017, INTERSPEECH.

[12]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[13]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[14]  Sree Hari Krishnan Parthasarathi,et al.  fMLLR based feature-space speaker adaptation of DNN acoustic models , 2015, INTERSPEECH.

[15]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Rauf Izmailov,et al.  Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[18]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Murali Karthick Baskar,et al.  DNNs for unsupervised extraction of pseudo speaker-normalized features without explicit adaptation data , 2017, Speech Commun..

[20]  Li-Rong Dai,et al.  Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[22]  Murali Karthick Baskar,et al.  DNNs for Unsupervised Extraction of Pseudo FMLLR Features Without Explicit Adaptation Data , 2016, INTERSPEECH.

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Horacio Franco,et al.  Connectionist speaker normalization and adaptation , 1995, EUROSPEECH.