Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation

We propose a new training objective named orderagnostic cross entropy (OAXE) for fully nonautoregressive translation (NAT) models. OAXE improves the standard cross-entropy loss to ameliorate the effect of word reordering, which is a common source of the critical multimodality problem in NAT. Concretely, OAXE removes the penalty for word order errors, and computes the cross entropy loss based on the best possible alignment between model predictions and target tokens. Since the log loss is very sensitive to invalid references, we leverage cross entropy initialization and loss truncation to ensure the model focuses on a good part of the search space. Extensive experiments on major WMT benchmarks show that OAXE substantially improves translation performance, setting new state of the art for fully NAT models. Further analyses show that OAXE alleviates the multimodality problem by reducing token repetitions and increasing prediction confidence. Our code, data, and trained models are available at https://github.com/ tencent-ailab/ICML21_OAXE.

[1]  Aleksandar Ilic,et al.  Fast block distributed CUDA implementation of the Hungarian algorithm , 2019, J. Parallel Distributed Comput..

[2]  Graham Neubig,et al.  Understanding Knowledge Distillation in Non-autoregressive Machine Translation , 2020, ICLR.

[3]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[4]  Tatsunori B. Hashimoto,et al.  Improved Natural Language Generation via Loss Truncation , 2020, ACL.

[5]  Leonidas J. Guibas,et al.  Exploiting Probabilistic Independence for Permutations , 2009, AISTATS.

[6]  Kyunghyun Cho,et al.  Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior , 2019, AAAI.

[7]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[8]  Jungo Kasai,et al.  Non-autoregressive Machine Translation with Disentangled Context Transformer , 2020, ICML.

[9]  Shuming Shi,et al.  Neural Machine Translation with Adequacy-Oriented Learning , 2018, AAAI.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Jungo Kasai,et al.  Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation , 2020, ICLR.

[12]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[13]  Zhi-Hong Deng,et al.  Fast Structured Decoding for Sequence Models , 2019, NeurIPS.

[14]  Tie-Yan Liu,et al.  Hint-Based Training for Non-Autoregressive Machine Translation , 2019, EMNLP.

[15]  Yiming Yang,et al.  An EM Approach to Non-autoregressive Conditional Sequence Generation , 2020, ICML.

[16]  Yu Bao,et al.  Glancing Transformer for Non-Autoregressive Neural Machine Translation , 2020, ArXiv.

[17]  Zhaopeng Tu,et al.  Understanding and Improving Lexical Choice in Non-Autoregressive Translation , 2020, ICLR.

[18]  Omer Levy,et al.  Aligned Cross Entropy for Non-Autoregressive Machine Translation , 2020, ICML.

[19]  Colin Cherry,et al.  Inference Strategies for Machine Translation with Conditional Masking , 2020, EMNLP.

[20]  Eduard Hovy,et al.  FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow , 2019, EMNLP.

[21]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[22]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[23]  Dacheng Tao,et al.  Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation , 2021, ACL/IJCNLP.

[24]  Yang Liu,et al.  Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[25]  Jie Zhou,et al.  Minimizing the Bag-of-Ngrams Difference for Non-Autoregressive Neural Machine Translation , 2019, AAAI.

[26]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Xu Sun,et al.  Bag-of-Words as Target for Neural Machine Translation , 2018, ACL.

[29]  Mohammad Norouzi,et al.  Non-Autoregressive Machine Translation with Latent Alignments , 2020, EMNLP.

[30]  Tie-Yan Liu,et al.  A Study of Non-autoregressive Model for Sequence Generation , 2020, ACL.

[31]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Jindrich Libovický,et al.  End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification , 2018, EMNLP.

[34]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[35]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[36]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[37]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .