Weighted Transformer Network for Machine Translation

State-of-the-art results on neural machine translation often use attentional sequence-to-sequence models with some form of convolution or recursion. Vaswani et al. (2017) propose a new architecture that avoids recurrence and convolution completely. Instead, it uses only self-attention and feed-forward layers. While the proposed architecture achieves state-of-the-art results on several machine translation tasks, it requires a large number of parameters and training iterations to converge. We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Specifically, we replace the multi-head attention by multiple self-attention branches that the model learns to combine during the training process. Our model improves the state-of-the-art performance by 0.5 BLEU points on the WMT 2014 English-to-German translation task and by 0.4 on the English-to-French translation task.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Xavier Gastaldi,et al.  Shake-Shake regularization , 2017, ArXiv.

[4]  Wei Xu,et al.  Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation , 2016, TACL.

[5]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[6]  Samy Bengio,et al.  Can Active Memory Replace Attention? , 2016, NIPS.

[7]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[8]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[9]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[10]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[11]  Rico Sennrich,et al.  Deep architectures for Neural Machine Translation , 2017, WMT.

[12]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[14]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[16]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[17]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[18]  Yu Zhang,et al.  Training RNNs as Fast as CNNs , 2017, EMNLP 2018.

[19]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[20]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[21]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[22]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[23]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[24]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Lorenzo Torresani,et al.  BranchConnect: Large-Scale Visual Recognition with Learned Branch Connections , 2017, ArXiv.

[27]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[28]  Lorenzo Torresani,et al.  BranchConnect: Image Categorization with Learned Branch Connections , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[34]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[35]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Alexander M. Rush,et al.  Structured Attention Networks , 2017, ICLR.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.