An Investigation of a Knowledge Distillation Method for CTC Acoustic Models

End-to-end acoustic models, such as connectionist temporal classification (CTC) and the attention model, have been studied, and their speech recognition accuracies come close to those of conventional deep neural network (DNN)-hidden Markov models. However, most high-performance end-to-end models are not suitable for real-time (streaming) speech recognition because they are based on bidirectional recurrent neural networks (RNNs). In this study, to improve the performance of unidirectional RNN-based CTC, which is suitable for real-time processing, we investigate the knowledge distillation (KD)-based model compression method for training a CTC acoustic model. we evaluate a frame-level KD method and a sequence-level KD method for CTC model. The speech recognition experiments on Wall Street Journal tasks demonstrate that, the frame-level KD worsens the WERs ofunidirectional CTC model, whereas sequence-level KD can improve the WERs of the model.

[1]  Yu Zhang,et al.  Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM , 2017, INTERSPEECH.

[2]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[3]  Naoyuki Kanda,et al.  Maximum-a-Posteriori-Based Decoding for End-to-End Acoustic Models , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Naoyuki Kanda,et al.  Minimum Bayes risk training of CTC acoustic models in maximum a posteriori based decoding framework , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Mitch Weintraub,et al.  The Hub and Spoke Paradigm for CSR Evaluation , 1994, HLT.

[7]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[8]  Michelle Guo,et al.  Knowledge distillation for small-footprint highway networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[10]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Jonathan Le Roux,et al.  Student-teacher network learning with enhanced features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[19]  Bhuvana Ramabhadran,et al.  Efficient Knowledge Distillation from an Ensemble of Teachers , 2017, INTERSPEECH.

[20]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[21]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[22]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[25]  Samy Bengio,et al.  An Online Sequence-to-Sequence Model Using Partial Conditioning , 2015, NIPS.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[28]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[29]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.