论文信息 - Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer

Transformer architecture has shown impressive performance in multiple research domains and has become the backbone of many neural network models. However, there is limited understanding on how it works. In particular, with a simple predictive loss, how the representation emerges from the gradient \emph{training dynamics} remains a mystery. In this paper, for 1-layer transformer with one self-attention layer plus one decoder layer, we analyze its SGD training dynamics for the task of next token prediction in a mathematically rigorous manner. We open the black box of the dynamic process of how the self-attention layer combines input tokens, and reveal the nature of underlying inductive bias. More specifically, with the assumption (a) no positional encoding, (b) long input sequence, and (c) the decoder layer learns faster than the self-attention layer, we prove that self-attention acts as a \emph{discriminative scanning algorithm}: starting from uniform attention, it gradually attends more to distinct key tokens for a specific next token to be predicted, and pays less attention to common key tokens that occur across different next tokens. Among distinct tokens, it progressively drops attention weights, following the order of low to high co-occurrence between the key and the query token in the training set. Interestingly, this procedure does not lead to winner-takes-all, but decelerates due to a \emph{phase transition} that is controllable by the learning rates of the two layers, leaving (almost) fixed token combination. We verify this \textbf{\emph{scan and snap}} dynamics on synthetic and real-world data (WikiText).

S. Du | Yuandong Tian | Beidi Chen | Yiping Wang

[1] Song Mei,et al. Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection , 2023, ArXiv.

[2] Siva Reddy,et al. The Impact of Positional Encoding on Length Generalization in Transformers , 2023, NeurIPS.

[3] Shuai Li,et al. The Closeness of In-Context Learning and Weight Shifting for Softmax Regression , 2023, ArXiv.

[4] Sanjeev Arora,et al. Do Transformers Parse while Predicting the Masked Word? , 2023, ArXiv.

[5] Andrej Risteski,et al. How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding , 2023, ICML.

[6] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7] S. Du,et al. Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron , 2023, COLT.

[8] A. Zhmoginov,et al. Transformers learn in-context by gradient descent , 2022, ICML.

[9] D. Schuurmans,et al. What learning algorithm is in-context learning? Investigations with linear models , 2022, ICLR.

[10] Andrew M. Dai,et al. Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[11] Michael E. Sander,et al. Vision Transformers provably learn spatial structure , 2022, Neural Information Processing Systems.

[12] Tom B. Brown,et al. In-context Learning and Induction Heads , 2022, ArXiv.

[13] Percy Liang,et al. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes , 2022, NeurIPS.

[14] S. Kakade,et al. Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit , 2022, NeurIPS.

[15] Yuhuai Wu,et al. Exploring Length Generalization in Large Language Models , 2022, NeurIPS.

[16] Yuandong Tian. Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning , 2022, ICLR.

[17] Hyung Won Chung,et al. UL2: Unifying Language Learning Paradigms , 2022, ICLR.

[18] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[19] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[20] Edward J. Hu,et al. Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , 2022, ArXiv.

[21] Michael Auli,et al. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[22] Yuandong Tian. Understanding Deep Contrastive Learning via Coordinate-wise Optimization , 2022, NeurIPS.

[23] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Benjamin L. Edelman,et al. Inductive Biases and Variable Creation in Self-Attention Mechanisms , 2021, ICML.

[25] Jason Weston,et al. NormFormer: Improved Transformer Pretraining with Extra Normalization , 2021, ArXiv.

[26] Noah A. Smith,et al. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , 2021, ICLR.

[27] Colin Wei,et al. Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers , 2021, NeurIPS.

[28] Krzysztof Choromanski,et al. On the Expressive Power of Self-Attention Matrices , 2021, ArXiv.

[29] C. Papadimitriou,et al. Self-Attention Networks Can Process Bounded Hierarchical Languages , 2021, ACL.

[30] Dan Klein,et al. Approximating How Single Head Attention Learns , 2021, ArXiv.

[31] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[32] Sashank J. Reddi,et al. Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[33] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[34] Navin Goyal,et al. On the Ability and Limitations of Transformers to Recognize Formal Languages , 2020, EMNLP.

[35] Ryan J. Lowe,et al. Learning to summarize from human feedback , 2020, NeurIPS 2020.