论文信息 - Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models - 字舞流文

Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

In this paper, we deeply explore several mechanisms employed by Transformer-based language models in factual recall tasks. In zero-shot scenarios, given a prompt like ``The capital of France is,'' task-specific attention heads extract the topic entity, such as ``France,'' from the context and pass it to subsequent MLPs to recall the required answer such as ``Paris.'' We introduce a novel analysis method aimed at decomposing the outputs of the MLP into components understandable by humans. Through this method, we quantify the function of the MLP layer following these task-specific heads. In the residual stream, it either erases or amplifies the information originating from individual heads. Moreover, it generates a component that redirects the residual stream towards the direction of its expected answer. These zero-shot mechanisms are also employed in few-shot scenarios. Additionally, we observed a widely existent anti-overconfidence mechanism in the final layer of models, which suppresses correct predictions. We mitigate this suppression by leveraging our interpretation to improve factual recall confidence. Our interpretations have been evaluated across various language models, including the GPT-2 families, 1.3B OPT, and 7B Llama-2, encompassing diverse tasks spanning various domains of factual knowledge.

Ang Lv | Yuhan Chen | Jian Xie | Lifeng Liu | Ji-Rong Wen | Yulong Wang | Kaiyi Zhang | Rui Yan

[1] Benjamin Heinzerling,et al. Monotonic Representation of Numeric Properties in Language Models , 2024, ArXiv.

[2] Mrinmaya Sachan,et al. Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals , 2024, ArXiv.

[3] V. Veselovsky,et al. Do Llamas Work in English? On the Latent Language of Multilingual Transformers , 2024, ArXiv.

[4] Quan Tu,et al. CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation , 2024, ArXiv.

[5] Zeping Yu,et al. Locating Factual Knowledge in Large Language Models: Exploring the Residual Stream and Analyzing Subvalues in Vocabulary Space , 2023, 2312.12141.

[6] Yuchuan Wu,et al. Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use , 2023, ArXiv.

[7] Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task , 2023, ArXiv.

[8] Aleksandar Makelov,et al. Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching , 2023, ArXiv.

[9] Shufang Xie,et al. Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation of the Reversal Curse , 2023, ArXiv.

[10] Ellie Pavlick,et al. Characterizing Mechanisms for Factual Recall in Language Models , 2023, EMNLP.

[11] Ellie Pavlick,et al. Circuit Component Reuse Across Tasks in Transformer Language Models , 2023, ArXiv.

[12] Can Rager,et al. An Adversarial Example for Direct Logit Attribution: Memory Management in gelu-4l , 2023, ArXiv.

[13] Guangxuan Xiao,et al. Efficient Streaming Language Models with Attention Sinks , 2023, ArXiv.

[14] Fred Zhang,et al. Towards Best Practices of Activation Patching in Language Models: Metrics and Methods , 2023, ICLR.

[15] S. Legg,et al. The Hydra Effect: Emergent Self-repair in Language Model Computations , 2023, ArXiv.

[16] J. Z. Kolter,et al. Universal and Transferable Adversarial Attacks on Aligned Language Models , 2023, ArXiv.

[17] Rohin Shah,et al. Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , 2023, ArXiv.

[18] Eric Michael Smith,et al. Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[19] Kentaro Inui,et al. Transformer Language Models Handle Word Frequency in Prediction Head , 2023, ACL.

[20] A. Globerson,et al. Dissecting Recall of Factual Associations in Auto-Regressive Language Models , 2023, EMNLP.

[21] Augustine N. Mavor-Parker,et al. Towards Automated Circuit Discovery for Mechanistic Interpretability , 2023, NeurIPS.

[22] Wayne Xin Zhao,et al. A Survey of Large Language Models , 2023, ArXiv.

[23] J. Steinhardt,et al. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , 2022, ArXiv.

[24] Tom B. Brown,et al. In-context Learning and Induction Heads , 2022, ArXiv.

[25] Dario Amodei,et al. Toy Models of Superposition , 2022, ArXiv.

[26] David Bau,et al. Locating and Editing Factual Associations in GPT , 2022, NeurIPS.

[27] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[28] Tom B. Brown,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[29] Nick Cammarata,et al. Zoom In: An Introduction to Circuits , 2020 .

[30] Hallucination , 2020, Definitions.

[31] Tom B. Brown,et al. Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[32] Giulia Battistoni,et al. Causality , 2019, Mind and the Present.

[33] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[34] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] M. Mercer. Siren's song. , 2014, Annals of emergency medicine.

[36] Judea Pearl,et al. Direct and Indirect Effects , 2001, UAI.

[37] Huaiyu Zhu. On Information and Sufficiency , 1997 .

[38] Stylized Dialogue Generation with Multi-Pass Dual Learning , 2021, NeurIPS.

[39] Yonatan Belinkov,et al. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.

[40] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .