Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

In-context learning (ICL) emerges as a promising capability of large language models (LLMs) by providing them with demonstration examples to perform diverse tasks. However, the underlying mechanism of how LLMs learn from the provided context remains under-explored. In this paper, we investigate the working mechanism of ICL through an information flow lens. Our findings reveal that label words in the demonstration examples function as anchors: (1) semantic information aggregates into label word representations during the shallow computation layers' processing; (2) the consolidated information in label words serves as a reference for LLMs' final predictions. Based on these insights, we introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL. The promising applications of our findings again validate the uncovered ICL working mechanism and pave the way for future studies.

[1]  G. Xie,et al.  Unified Demonstration Retriever for In-Context Learning , 2023, ACL.

[2]  Lingpeng Kong,et al.  Compositional Exemplars for In-context Learning , 2023, ArXiv.

[3]  A. Zhmoginov,et al.  Transformers learn in-context by gradient descent , 2022, ArXiv.

[4]  Tom B. Brown,et al.  In-context Learning and Induction Heads , 2022, ArXiv.

[5]  Kang Min Yoo,et al.  Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations , 2022, EMNLP.

[6]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[7]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[8]  Sang Michael Xie,et al.  An Explanation of In-context Learning as Implicit Bayesian Inference , 2021, ICLR.

[9]  Luke Zettlemoyer,et al.  Noisy Channel Language Model Prompting for Few-Shot Text Classification , 2021, ACL.

[10]  Weizhu Chen,et al.  What Makes Good In-Context Examples for GPT-3? , 2021, DEELIO.

[11]  Li Dong,et al.  Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers , 2023, ACL.

[12]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[13]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Kentaro Inui,et al.  Attention Is Not Only a Weight: Analyzing Transformers with Vector Norms , 2020, EMNLP.

[15]  Kedhar Nath Narahari,et al.  SemEval-2019 Task 3: EmoContext Contextual Emotion Detection in Text , 2019, *SEMEVAL.

[16]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[17]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[18]  Aidong Zhang,et al.  A Survey on Context Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[19]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[22]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[23]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[24]  Eduard H. Hovy,et al.  Toward Semantics-Based Answer Pinpointing , 2001, HLT.