Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition

Recently, we have proposed a novel fast adaptation method for the hybrid DNN-HMM models in speech recognition [1]. This method relies on learning an adaptation NN that is capable of transforming input speech features for a certain speaker into a more speaker independent space given a suitable speaker code. Speaker codes are learned for each speaker during adaptation. The whole multi-speaker training dataset is used to learn the adaptation NN weights. Our previous work has shown that this method is quite effective in adapting DNNs even when only a very small amount of adaptation data is available. However, the proposed method does not work well in the case of convolutional neural network (CNN). In this paper, we investigate the fast adaptation of CNN models. We first modify the speaker code based adaptation method to better suit to the CNN structure. Moreover, we investigate a new adaptation scheme using speaker specific adaptive nodes output weights. These weights scale different nodes outputs to optimize the model for new speakers. Experimental results on the TIMIT dataset demonstrates that both methods are quite effective in terms of adapting CNN based acoustic models and we can achieve even better performance by combining these two methods together.

[1]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[2]  Jinyu Li,et al.  Hermitian based Hidden Activation Functions for Adaptation of Hybrid HMM/ANN Models , 2012, INTERSPEECH.

[3]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[4]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[5]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[6]  Philip C. Woodland,et al.  Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models , 1997, Comput. Speech Lang..

[7]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[9]  Pietro Laface,et al.  Linear hidden transformations for adaptation of hybrid ANN/HMM models , 2007, Speech Commun..

[10]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Nikko Ström Automatic Continuous Speech Recognition with Rapid Speaker Adaptation for Human/machine Interaction , 1997 .

[12]  Philip C. Woodland Speaker adaptation for continuous density HMMs: a review , 2001 .

[13]  Yu Hu,et al.  Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMS in acoustic modeling , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[14]  Stéphane Dupont,et al.  Fast speaker adaptation of artificial neural networks for automatic speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[17]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Li Deng,et al.  A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.