Exploring Monolingual Data for Neural Machine Translation with Knowledge Distillation

We explore two types of monolingual data that can be included in knowledge distillation training for neural machine translation (NMT). The first is the source-side monolingual data. Second, is the target-side monolingual data that is used as back-translation data. Both datasets are (forward-)translated by a teacher model from source-language to target-language, which are then combined into a dataset for smaller student models. We find that source-side monolingual data improves model performance when evaluated by test-set originated from source-side. Likewise, targetside data has a positive effect on the test-set in the opposite direction. We also show that it is not required to train the student model with the same data used by the teacher, as long as the domains are the same. Finally, we find that combining source-side and target-side yields in better performance than relying on just one side of the monolingual data.

[1]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[2]  Vinod Ganapathy,et al.  A framework for the extraction of Deep Neural Networks by leveraging public data , 2019, ArXiv.

[3]  Rico Sennrich,et al.  Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation , 2019, ArXiv.

[4]  Myle Ott,et al.  On The Evaluation of Machine Translation SystemsTrained With Back-Translation , 2019, ACL.

[5]  Kenneth Heafield,et al.  Compressing Neural Machine Translation Models with 4-bit Precision , 2020, NGT@ACL.

[6]  Markus Freitag,et al.  Ensemble Distillation for Neural Machine Translation , 2017, ArXiv.

[7]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[8]  Christof Monz,et al.  Optimizing Transformer for Low-Resource Neural Machine Translation , 2020, COLING.

[9]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[10]  Dawn Song,et al.  Imitation Attacks and Defenses for Black-box Machine Translation Systems , 2020, EMNLP.

[11]  Marcin Junczys-Dowmunt,et al.  Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation , 2019, WMT.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Kenneth Heafield,et al.  Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task , 2020, NGT.

[14]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[15]  Antonio Toral,et al.  The Effect of Translationese in Machine Translation Test Sets , 2019, WMT.

[16]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[18]  André F. T. Martins,et al.  Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[19]  Rico Sennrich,et al.  Revisiting Low-Resource Neural Machine Translation: A Case Study , 2019, ACL.

[20]  Ankur P. Parikh,et al.  Thieves on Sesame Street! Model Extraction of BERT-based APIs , 2019, ICLR.

[21]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[22]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[23]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[24]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[25]  Marcin Junczys-Dowmunt,et al.  From Research to Production and Back: Ludicrously Fast Neural Machine Translation , 2019, EMNLP.

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.