Sparse Autoencoders Find Highly Interpretable Features in Language Models

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

[1]  Dan Hendrycks,et al.  An Overview of Catastrophic AI Risks , 2023, ArXiv.

[2]  Stella Rose Biderman,et al.  LEACE: Perfect linear concept erasure in closed form , 2023, ArXiv.

[3]  Augustine N. Mavor-Parker,et al.  Towards Automated Circuit Discovery for Mechanistic Interpretability , 2023, ArXiv.

[4]  Oskar van der Wal,et al.  Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , 2023, ICML.

[5]  Graham Bex-Priestley Gender as Name , 2022, Journal of Ethics and Social Philosophy.

[6]  J. Steinhardt,et al.  Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , 2022, ArXiv.

[7]  Dario Amodei,et al.  Toy Models of Superposition , 2022, ArXiv.

[8]  Richard Ngo The alignment problem from a deep learning perspective , 2022, ICLR.

[9]  M. Lewis,et al.  LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[10]  Yoav Goldberg,et al.  Linear Adversarial Concept Erasure , 2022, ICML.

[11]  Christopher Potts,et al.  Causal Abstractions of Neural Networks , 2021, NeurIPS.

[12]  Isabelle Augenstein,et al.  Is Sparse Attention more Interpretable? , 2021, ACL.

[13]  Yann LeCun,et al.  Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors , 2021, DEELIO.

[14]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[15]  Nick Cammarata,et al.  Zoom In: An Introduction to Circuits , 2020 .

[16]  Zhihui Zhu,et al.  Analysis of the Optimization Landscapes for Overcomplete Representation Learning , 2019, ArXiv.

[17]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[18]  Georgios Georgiadis,et al.  Accelerating Convolutional Neural Networks via Activation Map Compression , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[20]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[21]  Bruno A Olshausen,et al.  Sparse coding of sensory inputs , 2004, Current Opinion in Neurobiology.

[22]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[23]  Kunihiko Fukushima,et al.  Cognitron: A self-organizing multilayered neural network , 1975, Biological Cybernetics.

[24]  Richard Ngo,et al.  The alignment problem from a deep learning perspective , 2022 .

[25]  Hiroya Inakoshi,et al.  Elite BackProp: Training Sparse Interpretable Neurons , 2021, NeSy.

[26]  Yonatan Belinkov,et al.  Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.