Knowledge distillation across ensembles of multilingual models for low-resource languages

This paper investigates the effectiveness of knowledge distillation in the context of multilingual models. We show that with knowledge distillation, Long Short-Term Memory(LSTM) models can be used to train standard feed-forward Deep Neural Network (DNN) models for a variety of low-resource languages. We then examine how the agreement between the teacher's best labels and the original labels affects the student model's performance. Next, we show that knowledge distillation can be easily applied to semi-supervised learning to improve model performance. We also propose a promising data selection method to filter un-transcribed data. Then we focus on knowledge transfer among DNN models with multilingual features derived from CNN+DNN, LSTM, VGG, CTC and attention models. We show that a student model equipped with better input features not only learns better from the teacher's labels, but also outperforms the teacher. Further experiments suggest that by learning from each other, the original ensemble of various models is able to evolve into a new ensemble with even better combined performance.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[3]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[4]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[5]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[6]  Lukás Burget,et al.  Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[7]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[8]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[9]  Tara N. Sainath,et al.  Making Deep Belief Networks effective for large vocabulary continuous speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  Tara N. Sainath,et al.  Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[11]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[12]  Mark J. F. Gales,et al.  Investigation of multilingual deep neural networks for spoken term detection , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Hermann Ney,et al.  Multilingual hierarchical MRASTA features for ASR , 2013, INTERSPEECH.

[15]  Hermann Ney,et al.  Multilingual MRASTA features for low-resource keyword search and speech recognition systems , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jan Cernocký,et al.  BUT 2014 Babel system: analysis of adaptation in NN based systems , 2014, INTERSPEECH.

[17]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[18]  Mark J. F. Gales,et al.  Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED , 2014, SLTU.

[19]  Martin Karafiát,et al.  Combination of multilingual and semi-supervised training for under-resourced languages , 2014, INTERSPEECH.

[20]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[21]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[22]  Brian Kingsbury,et al.  Multilingual representations for low resource speech recognition and keyword search , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[23]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[24]  Matthew Richardson,et al.  Blending LSTMs into CNNs , 2015, ICLR 2016.

[25]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[26]  Xiaohui Zhang,et al.  A diversity-penalizing ensemble training method for deep learning , 2015, INTERSPEECH.

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[29]  Mark J. F. Gales,et al.  Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.

[30]  Vaibhava Goel,et al.  Advances in Very Deep Convolutional Neural Networks for LVCSR , 2016, INTERSPEECH.

[31]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Zhiyuan Tang,et al.  Recurrent neural network training with dark knowledge transfer , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.