Locating Cross-Task Sequence Continuation Circuits in Transformers

While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of digits, number words, and months. Through the application of circuit analysis techniques, we identify key sub-circuits responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Overall, documenting shared computational structures enables better prediction of model behaviors, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.

[1]  Ellie Pavlick,et al.  Circuit Component Reuse Across Tasks in Transformer Language Models , 2023, ArXiv.

[2]  Fazl Barez,et al.  Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders , 2023, ArXiv.

[3]  J. Steinhardt,et al.  Overthinking the Truth: Understanding how Language Models Process False Demonstrations , 2023, ArXiv.

[4]  Shay B. Cohen,et al.  Neuron to Graph: Interpreting Language Model Neurons at Scale , 2023, ArXiv.

[5]  Ioannis Konstas,et al.  Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark , 2023, ACL.

[6]  Michael Hanna,et al.  How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model , 2023, ArXiv.

[7]  Augustine N. Mavor-Parker,et al.  Towards Automated Circuit Discovery for Mechanistic Interpretability , 2023, ArXiv.

[8]  Aryaman Arora,et al.  Localizing Model Behavior with Path Patching , 2023, ArXiv.

[9]  Arnab Sen Sharma,et al.  Mass-Editing Memory in a Transformer , 2022, ICLR.

[10]  Tom B. Brown,et al.  In-context Learning and Induction Heads , 2022, ArXiv.

[11]  Dario Amodei,et al.  Toy Models of Superposition , 2022, ArXiv.

[12]  Dylan Hadfield-Menell,et al.  Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , 2022, 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML).

[13]  Dan Hendrycks,et al.  X-Risk Analysis for AI Research , 2022, ArXiv.

[14]  Sardar Jaf,et al.  A Literature Survey of Recent Advances in Chatbots , 2021, Inf..

[15]  Tom Henighan,et al.  Scaling Laws for Transfer , 2021, ArXiv.

[16]  Omer Levy,et al.  Transformer Feed-Forward Layers Are Key-Value Memories , 2020, EMNLP.

[17]  Jessica Taylor,et al.  Alignment for Advanced Machine Learning Systems , 2020, Ethics of Artificial Intelligence.

[18]  Jacob Andreas,et al.  Compositional Explanations of Neurons , 2020, NeurIPS.

[19]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[20]  Nick Cammarata,et al.  Zoom In: An Introduction to Circuits , 2020 .

[21]  Alejandro Barredo Arrieta,et al.  Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI , 2019, Inf. Fusion.

[22]  Bolei Zhou,et al.  Interpreting the Latent Space of GANs for Semantic Face Editing , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  C. Robert Superintelligence: Paths, Dangers, Strategies , 2017 .

[26]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[27]  Jason Yosinski,et al.  Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks , 2016, ArXiv.

[28]  Hod Lipson,et al.  Convergent Learning: Do different neural networks learn the same representations? , 2015, FE@NIPS.

[29]  오상록,et al.  Domain Knowledge를 이용한 강화 학습 , 2001 .

[30]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[31]  Deborah Silver,et al.  Feature Visualization , 1994, Scientific Visualization.

[32]  Yonatan Belinkov,et al.  Investigating Gender Bias in Language Models Using Causal Mediation Analysis , 2020, NeurIPS.

[33]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .