On the Sub-Layer Functionalities of Transformer Decoder

There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role. During translation, the decoder must predict output tokens by considering both the source-language text from the encoder and the target-language prefix produced in previous steps. In this work, we study how Transformer-based decoders leverage information from the source and target languages -- developing a universal probe task to assess how information is propagated through each module of each decoder layer. We perform extensive experiments on three major translation datasets (WMT En-De, En-Fr, and En-Zh). Our analysis provides insight on when and where decoders leverage different sources. Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance -- a significant reduction in computation and number of parameters, and consequently a significant boost to both training and inference speed.

[1]  Zhaopeng Tu,et al.  Assessing the Ability of Self-Attention Networks to Learn Word Order , 2019, ACL.

[2]  Omer Levy,et al.  Deep RNNs Encode Soft Hierarchical Syntax , 2018, ACL.

[3]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[4]  Yang Liu,et al.  Modeling Coverage for Neural Machine Translation , 2016, ACL.

[5]  Tao Shen,et al.  DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[6]  Lemao Liu,et al.  On the Word Alignment from Neural Machine Translation , 2019, ACL.

[7]  Shuming Shi,et al.  Exploiting Deep Representations for Neural Machine Translation , 2018, EMNLP.

[8]  Andy Way,et al.  Exploiting Cross-Sentence Context for Neural Machine Translation , 2017, EMNLP.

[9]  Shuming Shi,et al.  On the Inference Calibration of Neural Machine Translation , 2020, ACL.

[10]  Joakim Nivre,et al.  Encoders Help You Disambiguate Word Senses in Neural Machine Translation , 2019, EMNLP/IJCNLP.

[11]  Andrew McCallum,et al.  Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[14]  Xing Wang,et al.  Context-Aware Self-Attention Networks , 2019, AAAI.

[15]  Yang Liu,et al.  Contrastive Unsupervised Word Alignment with Non-Local Features , 2014, AAAI.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Shuming Shi,et al.  Neural Machine Translation with Adequacy-Oriented Learning , 2018, AAAI.

[18]  Joakim Nivre,et al.  Understanding Neural Machine Translation by Simplification: The Case of Encoder-free Models , 2019, RANLP.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[21]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[22]  Jian Li,et al.  Multi-Head Attention with Disagreement Regularization , 2018, EMNLP.

[23]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Rico Sennrich,et al.  Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention , 2019, EMNLP.

[26]  Xing Wang,et al.  One Model to Learn Both: Zero Pronoun Prediction and Translation , 2019, EMNLP.

[27]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[28]  Yonatan Belinkov,et al.  Understanding and Improving Morphological Learning in the Neural Machine Translation Decoder , 2017, IJCNLP.

[29]  Jörg Tiedemann,et al.  An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.

[30]  Arianna Bisazza,et al.  The Lazy Encoder: A Fine-Grained Analysis of the Role of Morphology in Neural Machine Translation , 2018, EMNLP.

[31]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[32]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[33]  Xing Wang,et al.  Towards Understanding Neural Machine Translation with Word Importance , 2019, EMNLP.

[34]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[35]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[36]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[37]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[38]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[39]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[40]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  Rico Sennrich,et al.  Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts , 2019, WMT.

[43]  Xing Wang,et al.  How Does Selective Mechanism Improve Self-Attention Networks? , 2020, ACL.

[44]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[45]  Yonatan Belinkov,et al.  Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks , 2017, IJCNLP.

[46]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.