Future-Aware Knowledge Distillation for Neural Machine Translation

Although future context is widely regarded useful for word prediction in machine translation, it is quite difficult in practice to incorporate it into neural machine translation. In this paper, we propose a future-aware knowledge distillation framework (FKD) to address this issue. In the FKD framework, we learn to distill future knowledge from a backward neural language model (teacher) to future-aware vectors (student) during the training phase. The future-aware vector for each word position is computed in a bridge network and optimized towards the corresponding hidden state in the backward neural language model via a knowledge distillation mechanism. We further propose an algorithm to jointly train the neural machine translation model, neural language model and knowledge distillation module end-to-end. The learned future-aware vectors are incorporated into the attention layer of the decoder to provide full-range context information during the decoding phase. Experiments on the NIST Chinese-English and WMT English-German translation tasks show that the proposed method significantly improves translation quality and word alignment.

[1]  Markus Freitag,et al.  Ensemble Distillation for Neural Machine Translation , 2017, ArXiv.

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Josef van Genabith,et al.  Neural Automatic Post-Editing Using Prior Alignment and Reranking , 2017, EACL.

[4]  Zhaopeng Tu,et al.  Modeling Past and Future for Neural Machine Translation , 2017, TACL.

[5]  Jiajun Zhang,et al.  Synchronous Bidirectional Inference for Neural Sequence Generation , 2019, Artif. Intell..

[6]  Lemao Liu,et al.  Agreement on Target-bidirectional Neural Machine Translation , 2016, NAACL.

[7]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[8]  Enhong Chen,et al.  Regularizing Neural Machine Translation by Target-bidirectional Agreement , 2018, AAAI.

[9]  Taro Watanabe,et al.  Bidirectional Decoding for Statistical Machine Translation , 2002, COLING.

[10]  Mark Fishel,et al.  C-3MA: Tartu-Riga-Zurich Translation Systems for WMT17 , 2017, WMT.

[11]  Jan Niehues,et al.  Pre-Translation for Neural Machine Translation , 2016, COLING.

[12]  Yang Liu,et al.  Exploiting reverse target-side contexts for neural machine translation via asynchronous bidirectional decoding , 2019, Artif. Intell..

[13]  Rongrong Ji,et al.  Asynchronous Bidirectional Decoding for Neural Machine Translation , 2018, AAAI.

[14]  Eiichiro Sumita,et al.  Bidirectional Phrase-based Statistical Machine Translation , 2009, EMNLP.

[15]  Yang Liu,et al.  Contrastive Unsupervised Word Alignment with Non-Local Features , 2014, AAAI.

[16]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Deyi Xiong,et al.  A Context-Aware Recurrent Encoder for Neural Machine Translation , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jinsong Su,et al.  Neural Machine Translation with Deep Attention , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Kevin Knight,et al.  Multi-Source Neural Translation , 2016, NAACL.

[21]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[22]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Yang Liu,et al.  A Teacher-Student Framework for Zero-Resource Neural Machine Translation , 2017, ACL.

[25]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[26]  Rico Sennrich,et al.  Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[27]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[30]  Marcin Junczys-Dowmunt,et al.  An Exploration of Neural Sequence-to-Sequence Architectures for Automatic Post-Editing , 2017, IJCNLP.

[31]  Christopher Joseph Pal,et al.  Twin Networks: Using the Future as a Regularizer , 2017, ArXiv.

[32]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[33]  Deyi Xiong,et al.  A Sense-Based Translation Model for Statistical Machine Translation , 2014, ACL.

[34]  Qun Liu,et al.  Deep Neural Machine Translation with Linear Associative Unit , 2017, ACL.

[35]  Jiajun Zhang,et al.  Neural System Combination for Machine Translation , 2017, ACL.

[36]  Marta R. Costa-jussà,et al.  The TALP-UPC Neural Machine Translation System for German/Finnish-English Using the Inverse Direction Model in Rescoring , 2017, WMT.