Attention Understands Semantic Relations

Today, natural language processing heavily relies on pre-trained large language models. Even though such models are criticized for the poor interpretability, they still yield state-of-the-art solutions for a wide set of very different tasks. While lots of probing studies have been conducted to measure the models’ awareness of grammatical knowledge, semantic probing is less popular. In this work, we introduce the probing pipeline to study the representedness of semantic relations in transformer language models. We show that in this task, attention scores are nearly as expressive as the layers’ output activations, despite their lesser ability to represent surface cues. This supports the hypothesis that attention mechanisms are focusing not only on the syntactic relational information but also on the semantic one.

[1]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[2]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[3]  Roberto Navigli,et al.  REBEL: Relation Extraction By End-to-end Language generation , 2021, EMNLP.

[4]  Alena Fenogenova,et al.  RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark , 2020, EMNLP.

[5]  Dawn Song,et al.  Language Models are Open Knowledge Graphs , 2020, ArXiv.

[6]  Avishek Anand,et al.  BERTnesia: Investigating the capture and forgetting of knowledge in BERT , 2020, BLACKBOXNLP.

[7]  Tong Xu,et al.  Attention as Relation: Learning Supervised Multi-head Self-Attention for Relation Extraction , 2020, IJCAI.

[8]  Gözde Gül Sahin,et al.  LINSPECTOR: Multilingual Probing Tasks for Word Representations , 2019, CL.

[9]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[10]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[11]  Ping He,et al.  Fine-tuning BERT for Joint Entity and Relation Extraction in Chinese Medical Text , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[12]  Maosong Sun,et al.  DocRED: A Large-Scale Document-Level Relation Extraction Dataset , 2019, ACL.

[13]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[14]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[15]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Christophe Gravier,et al.  T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[18]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[19]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[20]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[21]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.