Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Translation Jiaxin Guo1, Minghan Wang1, Daimeng Wei1, Hengchao Shang1, Yuxia Wang2, Zongyao Li1, Zhengzhe Yu1, Zhanglin Wu1, Yimeng Chen1, Chang Su1, Min Zhang1, Lizhi Lei1, Shimin Tao1, Hao Yang1 1Huawei Translation Services Center, Beijing, China 2The University of Melbourne, Melbourne, Australia {guojiaxin1,wangminghan,weidaimeng,shanghengchao, lizongyao,yuzhengzhe,wuzhanglin2,chenyimeng,suchang8, zhangmin186,leilizhi,taoshimin,yanghao30}@huawei.com yuxiaw@student.unimelb.edu.au Abstract

[1]  Markus Freitag,et al.  Ensemble Distillation for Neural Machine Translation , 2017, ArXiv.

[2]  Omer Levy,et al.  Aligned Cross Entropy for Non-Autoregressive Machine Translation , 2020, ICML.

[3]  Marc'Aurelio Ranzato,et al.  Discriminative Reranking for Neural Machine Translation , 2021, ACL.

[4]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[5]  Graham Neubig,et al.  Understanding Knowledge Distillation in Non-autoregressive Machine Translation , 2020, ICLR.

[6]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[7]  Mohammad Norouzi,et al.  Non-Autoregressive Machine Translation with Latent Alignments , 2020, EMNLP.

[8]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Rico Sennrich,et al.  Controlling Politeness in Neural Machine Translation via Side Constraints , 2016, NAACL.

[12]  Jiatao Gu,et al.  Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade , 2020, FINDINGS.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Zhaopeng Tu,et al.  Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation , 2021, ICML.

[16]  Zhi-Hong Deng,et al.  Fast Structured Decoding for Sequence Models , 2019, NeurIPS.

[17]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[18]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.