论文信息 - Training Deeper Neural Machine Translation Models with Transparent Attention

Training Deeper Neural Machine Translation Models with Transparent Attention

While current state-of-the-art NMT models, such as RNN seq2seq and Transformers, possess a large number of parameters, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we attempt to train significantly (2-3x) deeper Transformer and Bi-RNN encoders for machine translation. We propose a simple modification to the attention mechanism that eases the optimization of deeper models, and results in consistent gains of 0.7-1.1 BLEU on the benchmark WMT’14 English-German and WMT’15 Czech-English tasks for both architectures.

[1] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[2] Jacob Devlin,et al. Sharp Models on Dull Hardware: Fast and Accurate Neural Machine Translation Decoding on the CPU , 2017, EMNLP.

[3] A. Emin Orhan,et al. Skip Connections as Effective Symmetry-Breaking , 2017, ArXiv.

[4] Wei Xu,et al. Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation , 2016, TACL.

[5] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[6] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[7] Jürgen Schmidhuber,et al. Highway Networks , 2015, ArXiv.

[8] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[10] Jascha Sohl-Dickstein,et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Understanding and Improvement , 2017, ArXiv.

[11] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[12] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[13] Gang Zeng,et al. Weighted residuals for very deep networks , 2016, 2016 3rd International Conference on Systems and Informatics (ICSAI).

[14] Tengyu Ma,et al. Identity Matters in Deep Learning , 2016, ICLR.

[15] Tomaso Poggio,et al. Learning Functions: When Is Deep Better Than Shallow , 2016, 1603.00988.

[16] Rico Sennrich,et al. Deep architectures for Neural Machine Translation , 2017, WMT.

[17] Dawn Xiaodong Song,et al. Gradients explode - Deep Networks are shallow - ResNet explained , 2017, ICLR.

[18] Quoc V. Le,et al. Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[19] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20] Jascha Sohl-Dickstein,et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[21] Qun Liu,et al. Deep Neural Machine Translation with Linear Associative Unit , 2017, ACL.

[22] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[23] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[24] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Razvan Pascanu,et al. Understanding the exploding gradient problem , 2012, ArXiv.

[26] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Ohad Shamir,et al. The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[28] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[29] Matus Telgarsky,et al. Benefits of Depth in Neural Networks , 2016, COLT.

[30] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.