Progress measures for grokking via mechanistic interpretability

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

[1]  J. Steinhardt,et al.  Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , 2022, ArXiv.

[2]  Tom B. Brown,et al.  In-context Learning and Induction Heads , 2022, ArXiv.

[3]  S. Kakade,et al.  Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit , 2022, NeurIPS.

[4]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[5]  J. Susskind,et al.  The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the Grokking Phenomenon , 2022, ArXiv.

[6]  Max Tegmark,et al.  Towards Understanding Grokking: An Effective Theory of Representation Learning , 2022, NeurIPS.

[7]  Tom B. Brown,et al.  Predictability and Surprise in Large Generative Models , 2022, FAccT.

[8]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[9]  J. Steinhardt,et al.  The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models , 2022, ICLR.

[10]  Yuri Burda,et al.  Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , 2022, ArXiv.

[11]  D. Hassabis,et al.  Acquisition of chess knowledge in AlphaZero , 2021, Proceedings of the National Academy of Sciences of the United States of America.

[12]  A. Rogozhnikov Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation , 2022, ICLR.

[13]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[14]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[15]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[17]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[18]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[21]  Scott T. Rickard,et al.  Comparing Measures of Sparsity , 2008, IEEE Transactions on Information Theory.