论文信息 - Discovering Variable Binding Circuitry with Desiderata

Discovering Variable Binding Circuitry with Desiderata

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

[1] Noah D. Goodman,et al. Interpretability at Scale: Identifying Causal Mechanisms in Alpaca , 2023, NeurIPS.

[2] Augustine N. Mavor-Parker,et al. Towards Automated Circuit Discovery for Mechanistic Interpretability , 2023, ArXiv.

[3] David Krueger,et al. Characterizing Manipulation from AI Systems , 2023, EAAMO.

[4] Noah D. Goodman,et al. Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , 2023, CLeaR.

[5] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[6] D. Klein,et al. Discovering Latent Knowledge in Language Models Without Supervision , 2022, ICLR.

[7] J. Steinhardt,et al. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , 2022, ArXiv.

[8] David Bau,et al. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , 2022, ICLR.

[9] Dylan Hadfield-Menell,et al. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , 2022, 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).

[10] David Bau,et al. Locating and Editing Factual Associations in GPT , 2022, NeurIPS.

[11] Noah D. Goodman,et al. Inducing Causal Structure for Interpretable Neural Networks , 2021, ICML.

[12] Nick Cammarata,et al. Zoom In: An Introduction to Circuits , 2020 .

[13] Diederik P. Kingma,et al. Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Yonatan Belinkov,et al. Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.

[16] A. A. Walters,et al. Relations between Variables , 1970 .