论文信息 - Weighted Transformer Network for Machine Translation

Weighted Transformer Network for Machine Translation

State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge. We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Specifically, we replace the multi-head attention by multiple self-attention branches that the model learns to combine during the training process. Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT 2014 English-to-German translation task and by 0.4 on the English-to-French translation task.

[1] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[3] Xavier Gastaldi,et al. Shake-Shake regularization , 2017, ArXiv.

[4] Wei Xu,et al. Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation , 2016, TACL.

[5] Richard Socher,et al. A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[6] Samy Bengio,et al. Can Active Memory Replace Attention? , 2016, NIPS.

[7] Jakob Uszkoreit,et al. A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[8] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[9] Lior Wolf,et al. Using the Output Embedding to Improve Language Models , 2016, EACL.

[10] Chris Dyer,et al. On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[11] Rico Sennrich,et al. Deep architectures for Neural Machine Translation , 2017, WMT.

[12] Geoffrey Zweig,et al. The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[14] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[16] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[17] Hakan Inan,et al. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[18] Yu Zhang,et al. Training RNNs as Fast as CNNs , 2017, EMNLP 2018.

[19] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[20] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[21] Sepp Hochreiter,et al. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[22] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[23] Yann Dauphin,et al. A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[24] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[25] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Lorenzo Torresani,et al. BranchConnect: Large-Scale Visual Recognition with Learned Branch Connections , 2017, ArXiv.

[27] Richard Socher,et al. Quasi-Recurrent Neural Networks , 2016, ICLR.

[28] Lorenzo Torresani,et al. BranchConnect: Image Categorization with Learned Branch Connections , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33] Bowen Zhou,et al. A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[34] Alex Graves,et al. Neural Machine Translation in Linear Time , 2016, ArXiv.

[35] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Alexander M. Rush,et al. Structured Attention Networks , 2017, ICLR.

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.