Multi-Task Learning with Shared Encoder for Non-Autoregressive Machine Translation

Non-Autoregressive machine Translation (NAT) models have demonstrated significant inference speedup but suffer from inferior translation accuracy. The common practice to tackle the problem is transferring the Autoregressive machine Translation (AT) knowledge to NAT models, e.g., with knowledge distillation. In this work, we hypothesize and empirically verify that AT and NAT encoders capture different linguistic properties of source sentences. Therefore, we propose to adopt multi-task learning to transfer the AT knowledge to NAT models through encoder sharing. Specifically, we take the AT model as an auxiliary task to enhance NAT model performance. Experimental results on WMT14 En-De and WMT16 En-Ro datasets show that the proposed Multi-Task NAT achieves significant improvements over the baseline NAT models. Furthermore, the performance on large-scale WMT19 and WMT20 En-De datasets confirm the consistency of our proposed method. In addition, experimental results demonstrate that our Multi-Task NAT is complementary to knowledge distillation, the standard knowledge transfer method for NAT.

[1]  Dacheng Tao,et al.  Context-Aware Cross-Attention for Non-Autoregressive Translation , 2020, COLING.

[2]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[3]  Enhong Chen,et al.  Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation , 2019, AAAI.

[4]  Hao Zhou,et al.  Imitation Learning for Non-Autoregressive Neural Machine Translation , 2019, ACL.

[5]  Changhan Wang,et al.  Levenshtein Transformer , 2019, NeurIPS.

[6]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[8]  Xing Wang,et al.  Self-Attention with Structural Position Representations , 2019, EMNLP.

[9]  Jörg Tiedemann,et al.  An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.

[10]  Samy Bengio,et al.  Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Di He,et al.  Non-Autoregressive Machine Translation with Auxiliary Regularization , 2019, AAAI.

[13]  Andrew J. Davison,et al.  End-To-End Multi-Task Learning With Attention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Shilin He,et al.  Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation , 2020, EMNLP.

[15]  Tie-Yan Liu,et al.  Hint-Based Training for Non-Autoregressive Machine Translation , 2019, EMNLP.

[16]  Zhaopeng Tu,et al.  Understanding and Improving Lexical Choice in Non-Autoregressive Translation , 2020, ICLR.

[17]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[18]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[19]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[20]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[21]  Rico Sennrich,et al.  The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives , 2019, EMNLP.

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[24]  Noah A. Smith,et al.  Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation , 2020, ArXiv.

[25]  Graham Neubig,et al.  compare-mt: A Tool for Holistic Comparison of Language Generation Systems , 2019, NAACL.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[28]  Adam Lopez,et al.  Understanding Learning Dynamics Of Language Models with SVCCA , 2018, NAACL.

[29]  Wenxuan Wang,et al.  FPETS: Fully Parallel End-to-End Text-to-Speech System , 2018, AAAI.

[30]  Jiajun Zhang,et al.  Improving Autoregressive NMT with Non-Autoregressive Model , 2020, AUTOSIMTRANS.

[31]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.