Domain adaptation of DNN acoustic models using knowledge distillation

Constructing deep neural network (DNN) acoustic models from limited training data is an important issue for the development of automatic speech recognition (ASR) applications that will be used in various application-specific acoustic environments. To this end, domain adaptation techniques that train a domain-matched model without overfitting by lever-aging pre-constructed source models are widely used. In this paper, we propose a novel domain adaptation method for DNN acoustic models based on the knowledge distillation framework. Knowledge distillation transfers the knowledge of a teacher model to a student model and offers better generalizability of the student model by controlling the shape of posterior probability distribution of the teacher model, which was originally proposed for model compression. We apply this framework to model adaptation. Our domain adaptation method avoids overfitting of the adapted model trained on limited data by transferring the knowledge of the source model to the adapted model by distillation. Experiments show that the proposed method can effectively avoid the overfitting of convolutional neural network based acoustic models and yield lower error rates than conventional adaptation methods.

[1]  Shigeru Katagiri,et al.  Speaker adaptive training for deep neural networks embedding linear transformation networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Zhiyuan Tang,et al.  Recurrent neural network training with dark knowledge transfer , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Sankaran Panchapagesan,et al.  Model Compression Applied to Small-Footprint Keyword Spotting , 2016, INTERSPEECH.

[4]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[5]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[6]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[7]  Atsushi Nakamura,et al.  Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Tomoko Matsui,et al.  Robust Speech Recognition Using Generalized Distillation Framework , 2016, INTERSPEECH.

[9]  Koichi Shinoda,et al.  Wise teachers train better DNN acoustic models , 2016, EURASIP J. Audio Speech Music. Process..

[10]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[11]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Pietro Laface,et al.  Linear hidden transformations for adaptation of hybrid ANN/HMM models , 2007, Speech Commun..

[15]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[17]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[18]  Tomohiro Nakatani,et al.  Context Adaptive Neural Network for Rapid Adaptation of Deep CNN Based Acoustic Models , 2016, INTERSPEECH.

[19]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[20]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[21]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.