论文信息 - On the Sub-Layer Functionalities of Transformer Decoder - 字舞流文

On the Sub-Layer Functionalities of Transformer Decoder

There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role. During translation, the decoder must predict output tokens by considering both the source-language text from the encoder and the target-language prefix produced in previous steps. In this work, we study how Transformer-based decoders leverage information from the source and target languages -- developing a universal probe task to assess how information is propagated through each module of each decoder layer. We perform extensive experiments on three major translation datasets (WMT En-De, En-Fr, and En-Zh). Our analysis provides insight on when and where decoders leverage different sources. Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance -- a significant reduction in computation and number of parameters, and consequently a significant boost to both training and inference speed.

Shuming Shi | Zhaopeng Tu | Prasad Tadepalli | Stefan Lee | Longyue Wang | Yilin Yang | Prasad Tadepalli | Zhaopeng Tu | Shuming Shi | Longyue Wang | Stefan Lee | Yilin Yang

[1] Zhaopeng Tu,et al. Assessing the Ability of Self-Attention Networks to Learn Word Order , 2019, ACL.

[2] Omer Levy,et al. Deep RNNs Encode Soft Hierarchical Syntax , 2018, ACL.

[3] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[4] Yang Liu,et al. Modeling Coverage for Neural Machine Translation , 2016, ACL.

[5] Tao Shen,et al. DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[6] Lemao Liu,et al. On the Word Alignment from Neural Machine Translation , 2019, ACL.

[7] Shuming Shi,et al. Exploiting Deep Representations for Neural Machine Translation , 2018, EMNLP.

[8] Andy Way,et al. Exploiting Cross-Sentence Context for Neural Machine Translation , 2017, EMNLP.

[9] Shuming Shi,et al. On the Inference Calibration of Neural Machine Translation , 2020, ACL.

[10] Joakim Nivre,et al. Encoders Help You Disambiguate Word Senses in Neural Machine Translation , 2019, EMNLP/IJCNLP.

[11] Andrew McCallum,et al. Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[12] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[13] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[14] Xing Wang,et al. Context-Aware Self-Attention Networks , 2019, AAAI.

[15] Yang Liu,et al. Contrastive Unsupervised Word Alignment with Non-Local Features , 2014, AAAI.

[16] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Shuming Shi,et al. Neural Machine Translation with Adequacy-Oriented Learning , 2018, AAAI.

[18] Joakim Nivre,et al. Understanding Neural Machine Translation by Simplification: The Case of Encoder-free Models , 2019, RANLP.

[19] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[21] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[22] Jian Li,et al. Multi-Head Attention with Disagreement Regularization , 2018, EMNLP.

[23] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[24] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25] Rico Sennrich,et al. Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention , 2019, EMNLP.

[26] Xing Wang,et al. One Model to Learn Both: Zero Pronoun Prediction and Translation , 2019, EMNLP.

[27] Geoffrey E. Hinton,et al. Grammar as a Foreign Language , 2014, NIPS.

[28] Yonatan Belinkov,et al. Understanding and Improving Morphological Learning in the Neural Machine Translation Decoder , 2017, IJCNLP.

[29] Jörg Tiedemann,et al. An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.

[30] Arianna Bisazza,et al. The Lazy Encoder: A Fine-Grained Analysis of the Role of Morphology in Neural Machine Translation , 2018, EMNLP.

[31] Xing Shi,et al. Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[32] Byron C. Wallace,et al. Attention is not Explanation , 2019, NAACL.

[33] Xing Wang,et al. Towards Understanding Neural Machine Translation with Word Importance , 2019, EMNLP.

[34] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[35] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[36] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[37] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[38] Guillaume Lample,et al. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[39] Alex Wang,et al. What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[40] Yonatan Belinkov,et al. What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[41] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42] Rico Sennrich,et al. Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts , 2019, WMT.

[43] Xing Wang,et al. How Does Selective Mechanism Improve Self-Attention Networks? , 2020, ACL.

[44] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[45] Yonatan Belinkov,et al. Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks , 2017, IJCNLP.

[46] H. Bourlard,et al. Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.