Discovering Variable Binding Circuitry with Desiderata

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textit{desiderata}, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textit{variable binding circuitry} in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

[1]  Noah D. Goodman,et al.  Interpretability at Scale: Identifying Causal Mechanisms in Alpaca , 2023, NeurIPS.

[2]  Augustine N. Mavor-Parker,et al.  Towards Automated Circuit Discovery for Mechanistic Interpretability , 2023, ArXiv.

[3]  David Krueger,et al.  Characterizing Manipulation from AI Systems , 2023, EAAMO.

[4]  Noah D. Goodman,et al.  Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , 2023, CLeaR.

[5]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[6]  D. Klein,et al.  Discovering Latent Knowledge in Language Models Without Supervision , 2022, ICLR.

[7]  J. Steinhardt,et al.  Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , 2022, ArXiv.

[8]  David Bau,et al.  Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , 2022, ICLR.

[9]  Dylan Hadfield-Menell,et al.  Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , 2022, 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).

[10]  David Bau,et al.  Locating and Editing Factual Associations in GPT , 2022, NeurIPS.

[11]  Noah D. Goodman,et al.  Inducing Causal Structure for Interpretable Neural Networks , 2021, ICML.

[12]  Nick Cammarata,et al.  Zoom In: An Introduction to Circuits , 2020 .

[13]  Diederik P. Kingma,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Yonatan Belinkov,et al.  Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.

[16]  A. A. Walters,et al.  Relations between Variables , 1970 .