论文信息 - How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures

How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures

With recent advances in network architectures for Neural Machine Translation (NMT) recurrent models have effectively been replaced by either convolutional or self-attentional approaches, such as in the Transformer. While the main innovation of the Transformer architecture is its use of self-attentional layers, there are several other aspects, such as attention with multiple heads and the use of many attention layers, that distinguish the model from previous baselines. In this work we take a fine-grained look at the different architectures for NMT. We introduce an Architecture Definition Language (ADL) allowing for a flexible combination of common building blocks. Making use of this language we show in experiments that one can bring recurrent and convolutional models very close to the Transformer performance by borrowing concepts from the Transformer architecture, but not using self-attention. Additionally, we find that self-attention is much more important on the encoder side than on the decoder side, where it can be replaced by a RNN or CNN without a loss in performance in most settings. Surprisingly, even a model without any target side self-attention performs well.

Tobias Domhan | Tobias Domhan

[1] Richard Socher,et al. A Flexible Approach to Automated RNN Architecture Generation , 2017, ICLR.

[2] Philipp Koehn,et al. Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[3] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[4] Phil Blunsom,et al. Recurrent Continuous Translation Models , 2013, EMNLP.

[5] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[6] Oriol Vinyals,et al. Hierarchical Representations for Efficient Architecture Search , 2017, ICLR.

[7] Alon Lavie,et al. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[8] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[9] Quoc V. Le,et al. Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[10] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[11] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.