论文信息 - Self-Attention with Relative Position Representations

Self-Attention with Relative Position Representations

Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs.

[1] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[3] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[4] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Alex Graves,et al. Neural Machine Translation in Linear Time , 2016, ArXiv.

[6] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7] Pietro Liò,et al. Graph Attention Networks , 2017, ICLR.

[8] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[10] Jakob Uszkoreit,et al. A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[12] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[13] Jason Weston,et al. End-To-End Memory Networks , 2015, NIPS.

[14] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.