Sequence-Level Knowledge Distillation

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.

[1]  References , 1971 .

[2]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[3]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[7]  Ben Taskar,et al.  An End-to-End Discriminative Approach to Machine Translation , 2006, ACL.

[8]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[9]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[10]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[11]  David Chiang,et al.  Hope and Fear for Discriminative Training of Statistical Translation Models , 2012, J. Mach. Learn. Res..

[12]  A. L. Gross Hope and Fear , 2013, The journal of pastoral care & counseling : JPCC.

[13]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[14]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[15]  Andrew Zisserman,et al.  Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[16]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[17]  Kai Yu,et al.  Reshaping deep neural network for fast decoding by node-pruning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Colin Cherry,et al.  A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU , 2014, WMT@ACL.

[19]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[20]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[21]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[22]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[23]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[24]  Zhiyuan Liu,et al.  Joint Learning of Character and Word Embeddings , 2015, IJCAI.

[25]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[26]  David Chiang,et al.  Auto-Sizing Neural Networks: With Applications to n-gram Language Models , 2015, EMNLP.

[27]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[28]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[29]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[30]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[31]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[32]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[33]  Matthew Richardson,et al.  Blending LSTMs into CNNs , 2015, ICLR 2016.

[34]  R. Venkatesh Babu,et al.  Data-free Parameter Pruning for Deep Neural Networks , 2015, BMVC.

[35]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[36]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[37]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[38]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[40]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[41]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[42]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[43]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[44]  Ian McGraw,et al.  On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embedded speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[46]  Bo Wang,et al.  SYSTRAN's Pure Neural Machine Translation Systems , 2016, ArXiv.

[47]  Tara N. Sainath,et al.  Learning compact recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Suvrit Sra,et al.  Diversity Networks , 2015, ICLR.

[49]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[50]  Oriol Vinyals,et al.  Multilingual Language Processing From Bytes , 2015, NAACL.

[51]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[52]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[53]  Yoshua Bengio,et al.  Neural Networks with Few Multiplications , 2015, ICLR.

[54]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[55]  Zhi Jin,et al.  Distilling Word Embeddings: An Encoding Approach , 2015, CIKM.

[56]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[57]  Yang Liu,et al.  Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[58]  Wei Xu,et al.  Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation , 2016, TACL.

[59]  Noah A. Smith,et al.  Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser , 2016, EMNLP.

[60]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.