Towards Automated Circuit Discovery for Mechanistic Interpretability

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: to identify the circuit that implements the specified behavior in the model's computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work. Our code is available at https://github.com/ArthurConmy/Automatic-Circuit-Discovery.

[1]  Neel Nanda,et al.  Copy Suppression: Comprehensively Understanding an Attention Head , 2023, ArXiv.

[2]  Weiping Wang,et al.  A Survey on Model Compression for Large Language Models , 2023, ArXiv.

[3]  Dan Hendrycks,et al.  An Overview of Catastrophic AI Risks , 2023, ArXiv.

[4]  Noah D. Goodman,et al.  Interpretability at Scale: Identifying Causal Mechanisms in Alpaca , 2023, ArXiv.

[5]  D. Bertsimas,et al.  Finding Neurons in a Haystack: Case Studies with Sparse Probing , 2023, Trans. Mach. Learn. Res..

[6]  Michael Hanna,et al.  How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model , 2023, ArXiv.

[7]  Aryaman Arora,et al.  Localizing Model Behavior with Path Patching , 2023, ArXiv.

[8]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[9]  Lawrence Chan,et al.  A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations , 2023, ICML.

[10]  J. Steinhardt,et al.  Progress measures for grokking via mechanistic interpretability , 2023, ICLR.

[11]  Tom McGrath,et al.  Tracr: Compiled Transformers as a Laboratory for Interpretability , 2023, NeurIPS.

[12]  Dan Alistarh,et al.  SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot , 2023, ICML.

[13]  Khaled Kamal Saab,et al.  Hungry Hungry Hippos: Towards Language Modeling with State Space Models , 2022, ICLR.

[14]  Ruskin Raj Manku,et al.  DeepCuts: Single-Shot Interpretability based Pruning for BERT , 2022, ArXiv.

[15]  J. Steinhardt,et al.  Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , 2022, ArXiv.

[16]  Dan Alistarh,et al.  GMP*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods , 2022, 2210.06384.

[17]  Tom B. Brown,et al.  In-context Learning and Induction Heads , 2022, ArXiv.

[18]  Dario Amodei,et al.  Toy Models of Superposition , 2022, ArXiv.

[19]  Dylan Hadfield-Menell,et al.  Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , 2022, 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).

[20]  Matt J. Kusner,et al.  Causal Machine Learning: A Survey and Open Problems , 2022, ArXiv.

[21]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[22]  Dan Hendrycks,et al.  X-Risk Analysis for AI Research , 2022, ArXiv.

[23]  Michael W. Mahoney,et al.  A Fast Post-Training Pruning Framework for Transformers , 2022, NeurIPS.

[24]  Dan Alistarh,et al.  The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models , 2022, EMNLP.

[25]  David Bau,et al.  Locating and Editing Factual Associations in GPT , 2022, NeurIPS.

[26]  A. Torralba,et al.  Natural Language Descriptions of Deep Visual Features , 2022, ICLR.

[27]  Noah D. Goodman,et al.  Causal Distillation for Language Models , 2021, NAACL.

[28]  L. Zappella,et al.  Self-conditioning pre-trained language models , 2021, ICML.

[29]  Eran Yahav,et al.  Thinking Like Transformers , 2021, ICML.

[30]  Christopher Potts,et al.  Causal Abstractions of Neural Networks , 2021, NeurIPS.

[31]  Martin Wattenberg,et al.  An Interpretability Illusion for BERT , 2021, ArXiv.

[32]  Alexander M. Rush,et al.  Low-Complexity Probing via Finding Subnetworks , 2021, NAACL.

[33]  Ludwig Schubert,et al.  High/Low frequency detectors , 2021 .

[34]  Omer Levy,et al.  Transformer Feed-Forward Layers Are Key-Value Memories , 2020, EMNLP.

[35]  Peter Tiňo,et al.  A Survey on Neural Network Interpretability , 2020, IEEE Transactions on Emerging Topics in Computational Intelligence.

[36]  Evan Hubinger,et al.  An overview of 11 proposals for building safe advanced AI , 2020, ArXiv.

[37]  David Bau,et al.  Rewriting a Deep Generative Model , 2020, ECCV.

[38]  Jacob Andreas,et al.  Compositional Explanations of Neurons , 2020, NeurIPS.

[39]  Uri Shalit,et al.  CausaLM: Causal Model Explanation Through Counterfactual Language Models , 2020, CL.

[40]  Alexander M. Rush,et al.  Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[41]  Yonatan Belinkov,et al.  Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias , 2020, ArXiv.

[42]  Yoav Goldberg,et al.  Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? , 2020, ACL.

[43]  Nick Cammarata,et al.  Thread: Circuits , 2020, Distill.

[44]  Nick Cammarata,et al.  Zoom In: An Introduction to Circuits , 2020 .

[45]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[46]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[47]  Ali Farhadi,et al.  What’s Hidden in a Randomly Weighted Neural Network? , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Michael Arens,et al.  Analysis of Explainers of Black Box Deep Neural Networks for Computer Vision: A Survey , 2019, Mach. Learn. Knowl. Extr..

[49]  N. C. Chung,et al.  Concept Saliency Maps to Visualize Relevant Features in Deep Generative Models , 2019, 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA).

[50]  Ziheng Wang,et al.  Structured Pruning of Large Language Models , 2019, EMNLP.

[51]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[52]  Grzegorz Chrupala,et al.  Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop , 2019, Natural Language Engineering.

[53]  Samuel R. Bowman,et al.  Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[54]  P. Krogh,et al.  Corpus , 2018, Encyclopedia of Database Systems.

[55]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[56]  Hyrum S. Anderson,et al.  The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation , 2018, ArXiv.

[57]  Diederik P. Kingma,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[58]  Cengiz Öztireli,et al.  Towards better understanding of gradient-based attribution methods for Deep Neural Networks , 2017, ICLR.

[59]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[60]  Andrea Vedaldi,et al.  Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[61]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[62]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[63]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[64]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.

[65]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.

[66]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[67]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[68]  L. Koziol,et al.  Summary , 2014, Applied neuropsychology. Child.

[69]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[70]  Lihi Zelnik-Manor,et al.  Context-aware saliency detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[71]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[72]  Deborah Silver,et al.  Feature Visualization , 1994, Scientific Visualization.

[73]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[74]  Illtyd Trethowan Causality , 1938 .

[75]  Dylan Hadfield-Menell,et al.  Benchmarking Interpretability Tools for Deep Neural Networks , 2023, ArXiv.

[76]  Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020 , 2020, MLSys.

[77]  Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada , 2019, NeurIPS.

[78]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[79]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[80]  David Aldous Toy models , 2019, Multiple Scattering Theory: Electronic structure of solids.

[81]  Marco Tulio Ribeiro,et al.  “ Why Should I Trust You ? ” Explaining the Predictions of Any Classifier , 2016 .

[82]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[83]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.