Improving Knowledge Distillation of CTC-Trained Acoustic Models With Alignment-Consistent Ensemble and Target Delay

Knowledge distillation (KD) has been widely used to improve the performance of a simpler student model by imitating the outputs or intermediate representations of a more complex teacher model. The most commonly used KD technique is to minimize a Kullback-Leibler divergence between the output distributions of the teacher and student models. When it is applied to compressing acoustic models trained with a connectionist temporal classification (CTC) criterion, an assumption is made that the teacher and student share the same frame-level feature-transcription alignment. However, frame-level alignments learned by teachers can be inaccurate and unstable due to the lack of fine-grained frame-level guidance during CTC training. Forcing student to learn inaccurate alignments will lead to limited performance improvements. In this article, we investigate building powerful teacher models with more accurate and stable feature-transcription alignments. We achieve this goal by using a novel alignment-consistent ensemble (ACE) technique, where all models within an ensemble are jointly trained along with a regularization term to encourage consistent and stable alignments. With well-trained deep bidirectional LSTM (DBLSTM) ACE as a teacher, we can directly use the traditional frame-wise KD method to train DBLSTM students. When applying KD to transfer knowledge from a DBLSTM ACE to a deep unidirectional LSTM (DLSTM) student, a simple yet effective target delay technique is proposed to handle the alignment difference between bidirectional and unidirectional models. Experimental results on Switchboard-I speech recognition task show that, with DBLSTM ACE as a teacher, the simple frame-wise KD method can achieve competitive or better performance than other complex KD methods on DBLSTM students. When applying KD to build DLSTM students from DBLSTM teachers, our proposed target delay technique can achieve relative word error rate reductions of 14.2%$\sim$14.8% compared with the models trained from scratch, which outperforms other carefully-designed KD methods.

[1]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[2]  Jinyu Li,et al.  Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[4]  Brian Kingsbury,et al.  Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Yu Hu,et al.  Nonrecurrent Neural Structure for Long-Term Dependence , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[8]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[9]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[10]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[11]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[12]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[13]  Bhuvana Ramabhadran,et al.  Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, INTERSPEECH.

[14]  Mark J. F. Gales,et al.  Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.

[15]  Geoffrey Zweig,et al.  Toward Human Parity in Conversational Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Hisashi Kawai,et al.  An Investigation of a Knowledge Distillation Method for CTC Acoustic Models , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[19]  Yashesh Gaur,et al.  Reducing Bias in Production Speech Models , 2017, ArXiv.

[20]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[21]  Kai Yu,et al.  Phone Synchronous Speech Recognition With CTC Lattices , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Geoffrey Zweig,et al.  Advances in all-neural speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Johan Schalkwyk,et al.  Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Kai Chen,et al.  Training Deep Bidirectional LSTM Acoustic Model for LVCSR by a Context-Sensitive-Chunk BPTT Approach , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[26]  Chengzhu Yu,et al.  An Exploration of Directly Using Word as ACOUSTIC Modeling Unit for Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[27]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[29]  Kartik Audhkhasi,et al.  Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation , 2019, INTERSPEECH.

[30]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[31]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[32]  Jürgen Schmidhuber,et al.  Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[33]  Kartik Audhkhasi,et al.  Improved Knowledge Distillation from Bi-Directional to Uni-Directional LSTM CTC for End-to-End Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[34]  Hisashi Kawai,et al.  Investigation of Sequence-level Knowledge Distillation Methods for CTC Acoustic Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[36]  Michelle Guo,et al.  Knowledge distillation for small-footprint highway networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[38]  Jinyu Li,et al.  Improved training for online end-to-end speech recognition systems , 2017, INTERSPEECH.

[39]  Steve Renals,et al.  Small-Footprint Highway Deep Neural Networks for Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Daniel Jurafsky,et al.  First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs , 2014, ArXiv.

[41]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[42]  Kai Yu,et al.  Knowledge Distillation for Sequence Model , 2018, INTERSPEECH.

[43]  Yifan Gong,et al.  Speaker Adaptation for End-to-End CTC Models , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[44]  Naoyuki Kanda,et al.  Maximum-a-Posteriori-Based Decoding for End-to-End Acoustic Models , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Haisong Ding,et al.  Compression of CTC-Trained Acoustic Models by Dynamic Frame-Wise Distillation or Segment-Wise N-Best Hypotheses Imitation , 2019, INTERSPEECH.

[46]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[47]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[48]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[49]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[50]  Tara N. Sainath,et al.  Acoustic modelling with CD-CTC-SMBR LSTM RNNS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[51]  Xiaohui Zhang,et al.  A diversity-penalizing ensemble training method for deep learning , 2015, INTERSPEECH.

[52]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[53]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[54]  Bhuvana Ramabhadran,et al.  Efficient Knowledge Distillation from an Ensemble of Teachers , 2017, INTERSPEECH.

[55]  Yu Zhang,et al.  Highway long short-term memory RNNS for distant speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[56]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[57]  Bhuvana Ramabhadran,et al.  Training variance and performance evaluation of neural networks in speech , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58]  Jürgen Schmidhuber,et al.  Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.