Context-Aware Cross-Attention for Non-Autoregressive Translation

Non-autoregressive translation (NAT) significantly accelerates the inference process by predicting the entire target sequence. However, due to the lack of target dependency modelling in the decoder, the conditional generation process heavily depends on the cross-attention. In this paper, we reveal a localness perception problem in NAT cross-attention, for which it is difficult to adequately capture source context. To alleviate this problem, we propose to enhance signals of neighbour source tokens into conventional cross-attention. Experimental results on several representative datasets show that our approach can consistently improve translation quality over strong NAT baselines. Extensive analyses demonstrate that the enhanced cross-attention achieves better exploitation of source contexts by leveraging both local and global information.

[1]  Masaaki Nagata,et al.  NTT Neural Machine Translation Systems at WAT 2017 , 2019, WAT@IJCNLP.

[2]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[3]  Mohit Iyyer,et al.  Hard-Coded Gaussian Attention for Neural Machine Translation , 2020, ACL.

[4]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[5]  Jie Zhou,et al.  Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information , 2019, ArXiv.

[6]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[7]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[8]  Tie-Yan Liu,et al.  Hint-Based Training for Non-Autoregressive Machine Translation , 2019, EMNLP.

[9]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[10]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[11]  Zhi-Hong Deng,et al.  Fast Structured Decoding for Sequence Models , 2019, NeurIPS.

[12]  Shuming Shi,et al.  On the Sub-Layer Functionalities of Transformer Decoder , 2020, FINDINGS.

[13]  Dacheng Tao,et al.  Self-Attention with Cross-Lingual Position Representation , 2020, ACL.

[14]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[15]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[16]  Jungo Kasai,et al.  Parallel Machine Translation with Disentangled Context Transformer , 2020, ICML 2020.

[17]  Changhan Wang,et al.  Levenshtein Transformer , 2019, NeurIPS.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[20]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.

[21]  Xing Wang,et al.  Context-Aware Self-Attention Networks , 2019, AAAI.

[22]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[23]  Kyunghyun Cho,et al.  Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior , 2019, AAAI.

[24]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[25]  Andy Way,et al.  Exploiting Cross-Sentence Context for Neural Machine Translation , 2017, EMNLP.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Zhaopeng Tu,et al.  Rethinking the Value of Transformer Components , 2020, COLING.

[28]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[29]  Xing Wang,et al.  How Does Selective Mechanism Improve Self-Attention Networks? , 2020, ACL.

[30]  Lidia S. Chao,et al.  Leveraging Local and Global Patterns for Self-Attention Networks , 2019, ACL.