Aligned Cross Entropy for Non-Autoregressive Machine Translation

Non-autoregressive machine translation models significantly speed up decoding by allowing for parallel prediction of the entire target sequence. However, modeling word order is more challenging due to the lack of autoregressive factors in the model. This difficultly is compounded during training with cross entropy loss, which can highly penalize small shifts in word order. In this paper, we propose aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models. AXE uses a differentiable dynamic program to assign loss based on the best possible monotonic alignment between target tokens and model predictions. AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks, while setting a new state of the art for non-autoregressive models.

[1]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[2]  Jie Zhou,et al.  Minimizing the Bag-of-Ngrams Difference for Non-Autoregressive Neural Machine Translation , 2019, AAAI.

[3]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Di He,et al.  Non-Autoregressive Machine Translation with Auxiliary Regularization , 2019, AAAI.

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Eduard Hovy,et al.  FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow , 2019, EMNLP.

[8]  Noah A. Smith,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016, ACL 2016.

[9]  Graham Neubig,et al.  compare-mt: A Tool for Holistic Comparison of Language Generation Systems , 2019, NAACL.

[10]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[11]  Jindrich Libovický,et al.  End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification , 2018, EMNLP.

[12]  Yuexian Zou,et al.  Non-Autoregressive Video Captioning with Iterative Refinement , 2019, ArXiv.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[15]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[16]  Jungo Kasai,et al.  Parallel Machine Translation with Disentangled Context Transformer , 2020, ICML 2020.

[17]  Changhan Wang,et al.  Levenshtein Transformer , 2019, NeurIPS.

[18]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[19]  Jungo Kasai,et al.  Non-autoregressive Machine Translation with Disentangled Context Transformer , 2020, ICML.

[20]  Graham Neubig,et al.  Understanding Knowledge Distillation in Non-autoregressive Machine Translation , 2019, ICLR.

[21]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[24]  Tie-Yan Liu,et al.  Hint-Based Training for Non-Autoregressive Machine Translation , 2019, EMNLP.

[25]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[26]  Jakob Uszkoreit,et al.  Insertion Transformer: Flexible Sequence Generation via Insertion Operations , 2019, ICML.

[27]  Zhi-Hong Deng,et al.  Fast Structured Decoding for Sequence Models , 2019, NeurIPS.

[28]  Omer Levy,et al.  Semi-Autoregressive Training Improves Mask-Predict Decoding , 2020, ArXiv.

[29]  Can Zhang,et al.  Non-Autoregressive Coarse-to-Fine Video Captioning , 2019 .

[30]  Kyunghyun Cho,et al.  Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior , 2019, AAAI.

[31]  Mohammad Norouzi,et al.  Non-Autoregressive Machine Translation with Latent Alignments , 2020, EMNLP.

[32]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.