Knowledge Distillation Using Output Errors for Self-attention End-to-end Models

Most automatic speech recognition (ASR) neural network models are not suitable for mobile devices due to their large model sizes. Therefore, it is required to reduce the model size to meet the limited hardware resources. In this study, we investigate sequence-level knowledge distillation techniques of self-attention ASR models for model compression. In order to overcome the performance degradation of compressed models, our proposed method adds an exponential weight to the sequence-level knowledge distillation loss function, which reflects the word error rate of the output of the teacher model based on the ground-truth word sequences. Evaluated on LibriSpeech dataset, the proposed knowledge distillation method achieves significant improvements over the student baseline model.

[1]  Hisashi Kawai,et al.  An Investigation of a Knowledge Distillation Method for CTC Acoustic Models , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tara N. Sainath,et al.  Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[6]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[7]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[9]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Richard Socher,et al.  Improving End-to-End Speech Recognition with Policy Learning , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Kai Yu,et al.  Knowledge Distillation for Sequence Model , 2018, INTERSPEECH.

[12]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[13]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[14]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[15]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[16]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[17]  Tara N. Sainath,et al.  Compression of End-to-End Models , 2018, INTERSPEECH.

[18]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.