论文信息 - Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models

Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models

This paper presents knowledge distillation (KD) methods for training connectionist temporal classification (CTC) acoustic models. In a previous study, we proposed a KD method based on the sequence-level cross-entropy, and showed that the conventional KD method based on the frame-level cross-entropy did not work effectively for CTC acoustic models, whereas the proposed method improved the performance of the models. In this paper, we investigate the implementation of sequence-level KD for CTC models and propose a lattice-based sequence-level KD method. Experiments investigating model compression and the training of a noise-robust model using the Wall Street Journal (WSJ) and CHiME4 datasets demonstrate that the sequence-level KD methods improve the performance of CTC acoustic models on both two tasks, and show that the lattice-based method can compute the sequence-level KD more efficiently than the N-best-based method proposed in our previous work.

[1] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2] Andrew W. Senior,et al. Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[3] Xiong Xiao,et al. Developing Far-Field Speaker System Via Teacher-Student Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Jon Barker,et al. An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[5] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Dong Yu,et al. Knowledge Transfer in Permutation Invariant Training for Single-Channel Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Bhuvana Ramabhadran,et al. Efficient Knowledge Distillation from an Ensemble of Teachers , 2017, INTERSPEECH.

[8] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[9] Tara N. Sainath,et al. Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10] Yevgen Chebotar,et al. Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[11] Naoyuki Kanda,et al. Maximum-a-Posteriori-Based Decoding for End-to-End Acoustic Models , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12] Fernando Pereira,et al. Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[13] Florian Metze,et al. A first attempt at polyphonic sound event detection using connectionist temporal classification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[15] Brian Kingsbury,et al. Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Jonathan Le Roux,et al. Student-teacher network learning with enhanced features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Yifan Gong,et al. Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[18] Yu Zhang,et al. Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[19] Hisashi Kawai,et al. An Investigation of a Knowledge Distillation Method for CTC Acoustic Models , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[21] Naoyuki Kanda,et al. Sequence Distillation for Purely Sequence Trained Acoustic Models , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Alexander M. Rush,et al. Sequence-Level Knowledge Distillation , 2016, EMNLP.

[23] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[24] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[25] Yifan Gong,et al. Advancing Connectionist Temporal Classification with Attention Modeling , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Jungwon Lee,et al. Bridgenets: Student-Teacher Transfer Learning Based on Recursive Neural Networks and Its Application to Distant Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Michelle Guo,et al. Knowledge distillation for small-footprint highway networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[29] Tara N. Sainath,et al. Compression of End-to-End Models , 2018, INTERSPEECH.

[30] Takashi Masuko,et al. Simultaneous Speech Recognition and Acoustic Event Detection Using an LSTM-CTC Acoustic Model and a WFST Decoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31] Mark J. F. Gales,et al. Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.