Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multihead attention; other components can also contribute to Transformers’ progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available.1

[1]  Shikha Bordia,et al.  Do Attention Heads in BERT Track Syntactic Dependencies? , 2019, ArXiv.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Yang Liu,et al.  On Identifiability in Transformers , 2020, ICLR.

[4]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[5]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[6]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[9]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[10]  Willem Zuidema,et al.  Quantifying Attention Flow in Transformers , 2020, ACL.

[11]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[14]  Benjamin J. Wilson,et al.  Measuring Word Significance using Distributed Representations of Words , 2015, ArXiv.

[15]  Jörg Tiedemann,et al.  An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.

[16]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[17]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[18]  Alexander M. Rush,et al.  OpenNMT: Neural Machine Translation Toolkit , 2018, AMTA.

[19]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[20]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[21]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[22]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[23]  Joakim Nivre,et al.  An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation , 2018, WMT.

[24]  Yuexian Zou,et al.  Rethinking Skip Connection with Layer Normalization , 2020, COLING.

[25]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[26]  Kentaro Inui,et al.  Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , 2020, EMNLP.

[27]  Kentaro Inui,et al.  Word Rotator’s Distance , 2020, EMNLP.

[28]  Stefan L. Frank,et al.  Human Sentence Processing: Recurrence or Attention? , 2021, CMCL.

[29]  Leila Wehbe,et al.  Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain) , 2019, NeurIPS.

[30]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[31]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[34]  Rudolf Rosa,et al.  From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self-Attentions , 2019, BlackboxNLP@ACL.

[35]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[36]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[37]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[38]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[39]  Adrian Brasoveanu,et al.  Visualizing Transformers for NLP: A Brief Survey , 2020, 2020 24th International Conference Information Visualisation (IV).

[40]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[41]  Alexander D'Amour,et al.  The MultiBERTs: BERT Reproductions for Robustness Analysis , 2021, ArXiv.

[42]  Andreas Loukas,et al.  Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth , 2021, ICML.

[43]  Jaime S. Cardoso,et al.  Machine Learning Interpretability: A Survey on Methods and Metrics , 2019, Electronics.