What Does BERT Look at? An Analysis of BERT’s Attention

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention.

[1]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[2]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[3]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[4]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[5]  Daniel Jurafsky,et al.  Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context , 2018, ACL.

[6]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[7]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[8]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[9]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[10]  Tiejun Zhao,et al.  Syntax-Directed Attention for Neural Machine Translation , 2017, AAAI.

[11]  Rudolf Rosa,et al.  Extracting Syntactic Trees from Transformer Encoder Self-Attentions , 2018, BlackboxNLP@EMNLP.

[12]  Omer Levy,et al.  Deep RNNs Encode Soft Hierarchical Syntax , 2018, ACL.

[13]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP 2018.

[14]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[15]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[16]  Rico Sennrich,et al.  Context-Aware Neural Machine Translation Learns Anaphora Resolution , 2018, ACL.

[17]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.

[18]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Jason Weston,et al.  Learning Anaphoricity and Antecedent Ranking Features for Coreference Resolution , 2015, ACL.

[21]  Yoshimasa Tsuruoka,et al.  Tree-to-Sequence Attentional Neural Machine Translation , 2016, ACL.

[22]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[23]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[24]  Samuel R. Bowman,et al.  Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[25]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[26]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[27]  Jian Li,et al.  Multi-Head Attention with Disagreement Regularization , 2018, EMNLP.

[28]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Andrew McCallum,et al.  Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[31]  Jörg Tiedemann,et al.  An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.

[32]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[33]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[34]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[35]  Thomas L. Griffiths,et al.  Exploiting Attention to Reveal Shortcomings in Memory Models , 2018, BlackboxNLP@EMNLP.

[36]  Florian Mohnert,et al.  Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information , 2018, BlackboxNLP@EMNLP.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[39]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[40]  Jesse Vig,et al.  Visualizing Attention in Transformer-Based Language Representation Models , 2019, ArXiv.

[41]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.