Neuron to Graph: Interpreting Language Model Neurons at Scale

Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while enriching dataset examples with diverse samples to better encompass the full spectrum of neuron behaviour. These graphs can be visualised to aid researchers' manual interpretation, and can generate token activations on text for automatic validation by comparison with the neuron's ground truth activations, which we use to show that the model is better at predicting neuron activation than two baseline methods. We also demonstrate how the generated graph representations can be flexibly used to facilitate further automation of interpretability research, by searching for neurons with particular properties, or programmatically comparing neurons to each other to identify similar neurons. Our method easily scales to build graph representations for all neurons in a 6-layer Transformer model using a single Tesla T4 GPU, allowing for wide usability. We release the code and instructions for use at https://github.com/alexjfoote/Neuron2Graph.

[1]  Fazl Barez,et al.  System III: Learning with Domain Knowledge for Safety Constraints , 2023, arXiv.org.

[2]  Tom B. Brown,et al.  In-context Learning and Induction Heads , 2022, ArXiv.

[3]  Dario Amodei,et al.  Toy Models of Superposition , 2022, ArXiv.

[4]  Nicholas Carlini,et al.  Unsolved Problems in ML Safety , 2021, ArXiv.

[5]  Li Dong,et al.  Knowledge Neurons in Pretrained Transformers , 2021, ACL.

[6]  Martin Wattenberg,et al.  An Interpretability Illusion for BERT , 2021, ArXiv.

[7]  Alec Radford,et al.  Multimodal Neurons in Artificial Neural Networks , 2021 .

[8]  Ludwig Schubert,et al.  High/Low frequency detectors , 2021 .

[9]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[10]  Omer Levy,et al.  Transformer Feed-Forward Layers Are Key-Value Memories , 2020, EMNLP.

[11]  Yonatan Belinkov,et al.  Analyzing Individual Neurons in Pre-trained Language Models , 2020, EMNLP.

[12]  Ryan Cotterell,et al.  Intrinsic Probing through Dimension Selection , 2020, EMNLP.

[13]  Jacob Andreas,et al.  Compositional Explanations of Neurons , 2020, NeurIPS.

[14]  Nick Cammarata,et al.  An Overview of Early Vision in InceptionV1 , 2020 .

[15]  Nick Cammarata,et al.  Zoom In: An Introduction to Circuits , 2020 .

[16]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[17]  Yonatan Belinkov,et al.  What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models , 2018, AAAI.

[18]  Yonatan Belinkov,et al.  Identifying and Controlling Important Neurons in Neural Machine Translation , 2018, ICLR.

[19]  Hinrich Schütze,et al.  Interpretable Textual Neuron Representations for NLP , 2018, BlackboxNLP@EMNLP.

[20]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[21]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[22]  Deborah Silver,et al.  Feature Visualization , 1994, Scientific Visualization.

[23]  Hassan Sajjad,et al.  Implicit representations of event properties within contextual language models: Searching for “causativity neurons” , 2021, IWCS.

[24]  Yonatan Belinkov,et al.  Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.