Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connection and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified average-based self-attention sublayer and the encoder-decoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt. Source code for reproduction will be released soon.

[1]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[2]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[3]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4]  Mubarak Shah,et al.  Norm-Preservation: Why Residual Networks Can Become Extremely Deep? , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Min Zhang,et al.  Variational Neural Machine Translation , 2016, EMNLP.

[6]  Rongrong Ji,et al.  Asynchronous Bidirectional Decoding for Neural Machine Translation , 2018, AAAI.

[7]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[8]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Wei Xu,et al.  Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation , 2016, TACL.

[11]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[12]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[13]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[14]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[15]  Ankur Bapna,et al.  Training Deeper Neural Machine Translation Models with Transparent Attention , 2018, EMNLP.

[16]  Marcello Federico,et al.  Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[19]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[20]  Qun Liu,et al.  Deep Neural Machine Translation with Linear Associative Unit , 2017, ACL.

[21]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22]  Juergen Schmidhuber,et al.  Long Short-Term Memory Learns Context Free and Context Sensitive Languages , 2000 .

[23]  Jinsong Su,et al.  Neural Machine Translation with Deep Attention , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[25]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Quoc V. Le,et al.  The Evolved Transformer , 2019, ICML.

[29]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[30]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[31]  Jan Niehues,et al.  Very Deep Self-Attention Networks for End-to-End Speech Recognition , 2019, INTERSPEECH.

[32]  Marcello Federico,et al.  Deep Neural Machine Translation with Weakly-Recurrent Units , 2018, EAMT.

[33]  Jakob Uszkoreit,et al.  Blockwise Parallel Decoding for Deep Autoregressive Models , 2018, NeurIPS.

[34]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[35]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[36]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Omer Levy,et al.  Constant-Time Machine Translation with Conditional Masked Language Models , 2019, IJCNLP 2019.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[41]  Deyi Xiong,et al.  Accelerating Neural Transformer via an Average Attention Network , 2018, ACL.

[42]  Marcin Junczys-Dowmunt,et al.  Marian: Cost-effective High-Quality Neural Machine Translation in C++ , 2018, NMT@ACL.

[43]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.