Training Deeper Neural Machine Translation Models with Transparent Attention

While current state-of-the-art NMT models, such as RNN seq2seq and Transformers, possess a large number of parameters, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we attempt to train significantly (2-3x) deeper Transformer and Bi-RNN encoders for machine translation. We propose a simple modification to the attention mechanism that eases the optimization of deeper models, and results in consistent gains of 0.7-1.1 BLEU on the benchmark WMT’14 English-German and WMT’15 Czech-English tasks for both architectures.

[1]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[2]  Jacob Devlin,et al.  Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU , 2017, EMNLP.

[3]  A. Emin Orhan,et al.  Skip Connections as Effective Symmetry-Breaking , 2017, ArXiv.

[4]  Wei Xu,et al.  Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation , 2016, TACL.

[5]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[10]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Understanding and Improvement , 2017, ArXiv.

[11]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[12]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[13]  Gang Zeng,et al.  Weighted residuals for very deep networks , 2016, 2016 3rd International Conference on Systems and Informatics (ICSAI).

[14]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[15]  Tomaso Poggio,et al.  Learning Functions: When Is Deep Better Than Shallow , 2016, 1603.00988.

[16]  Rico Sennrich,et al.  Deep architectures for Neural Machine Translation , 2017, WMT.

[17]  Dawn Xiaodong Song,et al.  Gradients explode - Deep Networks are shallow - ResNet explained , 2017, ICLR.

[18]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[21]  Qun Liu,et al.  Deep Neural Machine Translation with Linear Associative Unit , 2017, ACL.

[22]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[23]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Razvan Pascanu,et al.  Understanding the exploding gradient problem , 2012, ArXiv.

[26]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[28]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[29]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[30]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.