Kformer: Knowledge Injection in Transformer Feed-Forward Layers

Knowledge-Enhanced Model have developed a diverse set of techniques for knowledge integration on different knowledge sources. However, most previous work neglect the language model’s own ability and simply concatenate external knowledge at the input. Recent work proposed that Feed Forward Network (FFN) in pre-trained language model can be seen as an memory that stored factual knowledge. In this work, we explore the FFN in Transformer and propose a novel knowledge fusion model, namely Kformer, which incorporates external knowledge through the feed-forward layer in Transformer. We empirically find that simply injecting knowledge into FFN can enhance the pretrained language model’s ability and facilitate current knowledge fusion methods. Our results on two benchmarks in the commonsense reasoning (i.e., SocialIQA) and medical question answering (i.e., MedQA-USMLE) domains demonstrate that Kformer can utilize external knowledge deeply and achieves absolute improvements in these tasks1.

[1]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[2]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[3]  Nan Duan,et al.  Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering , 2019, AAAI.

[4]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[5]  Chitta Baral,et al.  How Additional Knowledge can Improve Natural Language Commonsense Question Answering , 2020 .

[6]  Huajun Chen,et al.  KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction , 2021, ArXiv.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Pedro A. Szekely,et al.  Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering , 2020, FINDINGS.

[9]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[10]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[11]  Jonathan Berant,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[12]  Huajun Chen,et al.  Drop Redundant, Shrink Irrelevant: Selective Knowledge Injection for Language Pretraining , 2021, IJCAI.

[13]  Dilek Z. Hakkani-Tür,et al.  Incorporating Commonsense Knowledge Graph in Pretrained Models for Social Commonsense Tasks , 2020, DEELIO.

[14]  Yejin Choi,et al.  ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.

[15]  Zhe Zhao,et al.  K-BERT: Enabling Language Representation with Knowledge Graph , 2019, AAAI.

[16]  Xiaoyan Wang,et al.  Improving Natural Language Inference Using External Knowledge in the Science Questions Domain , 2018, AAAI.

[17]  Zhifang Sui,et al.  Knowledge Neurons in Pretrained Transformers , 2021, ArXiv.

[18]  Chenguang Zhu,et al.  Fusing Context Into Knowledge Graph for Commonsense Reasoning , 2020, ArXiv.

[19]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Di Jin,et al.  What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , 2020, Applied Sciences.

[22]  Tianyu Gao,et al.  KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation , 2019, ArXiv.

[23]  Ke Xu,et al.  Self-Attention Attribution: Interpreting Information Interactions Inside Transformer , 2020, AAAI.

[24]  Omer Levy,et al.  Transformer Feed-Forward Layers Are Key-Value Memories , 2020, Conference on Empirical Methods in Natural Language Processing.

[25]  Zhaopeng Tu,et al.  Rethinking the Value of Transformer Components , 2020, COLING.