An Efficient Transformer Decoder with Compressed Sub-layers

The large attention-based encoder-decoder network (Transformer) has become prevailing recently due to its effectiveness. But the high computation complexity of its decoder raises the inefficiency issue. By examining the mathematic formulation of the decoder, we show that under some mild conditions, the architecture could be simplified by compressing its sub-layers, the basic building block of Transformer, and achieves a higher parallelism. We thereby propose Compressed Attention Network, whose decoder layer consists of only one sub-layer instead of three. Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42× faster with performance on par with a strong baseline. This strong baseline is already 2× faster than the widely used standard baseline without loss in performance. The code is publicly available at https://github.com/Lollipop321/compressed-attention.

[1]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[2]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[3]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[6]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[7]  Jingbo Zhu,et al.  Neural Machine Translation with Joint Representation , 2020, AAAI.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[10]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[11]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[12]  Jingbo Zhu,et al.  Sharing Attention Weights for Fast Transformer , 2019, IJCAI.

[13]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[14]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[15]  Jingbo Zhu,et al.  Towards Fully 8-bit Integer Inference for the Transformer Model , 2020, IJCAI.

[16]  Noah A. Smith,et al.  Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation , 2020, ArXiv.

[17]  Deyi Xiong,et al.  Accelerating Neural Transformer via an Average Attention Network , 2018, ACL.

[18]  Rico Sennrich,et al.  Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention , 2019, EMNLP.

[19]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[20]  Jingbo Zhu,et al.  Shallow-to-Deep Training for Neural Machine Translation , 2020, EMNLP.

[21]  Di He,et al.  Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation , 2018, NeurIPS.

[22]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[23]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[24]  Ankur Bapna,et al.  Training Deeper Neural Machine Translation Models with Transparent Attention , 2018, EMNLP.

[25]  Orhan Firat,et al.  Massively Multilingual Neural Machine Translation , 2019, NAACL.

[26]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[27]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[28]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[29]  Graham Neubig,et al.  Findings of the Second Workshop on Neural Machine Translation and Generation , 2018, NMT@ACL.

[30]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT20 , 2021, WMT.

[31]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.