Data Distributional Properties Drive Emergent In-Context Learning in Transformers

Large transformer-based language models are able to perform few-shot learning (also known as in-context learning), without having been explicitly trained for it. We hypothesized that specific distributional properties of natural language might drive this emergent phenomenon, as these characteristics might lead to a kind of interpolation between few-shot meta-training (designed to elicit rapid few-shot learning) and standard supervised training (designed to elicit gradual in-weights learning). We also hypothesized that these distributional properties could lead to emergent few-shot learning in domains outside of language. Inspired by this idea, we ran a series of experiments on a standard image-based few-shot dataset. We discovered that a number of data properties did indeed promote the emergence of few-shot learning in transformer models. All of these properties are present in natural language – burstiness, long-tailedness, and many-to-one or one-to-many label mappings. The data influenced whether models were biased towards either few-shot learning vs. memorizing information in their weights; models could generally perform well at only one or the other. However, we discovered that an additional distributional property could allow the two capabilities to co-exist in the same model – a skewed, Zipfian distribution over classes – which occurs in language as well. Notably, training data that could elicit few-shot learning in transformers were unable to elicit few-shot learning in recurrent models. In sum, we find that few-shot learning emerges only from applying the right architecture to the right data distribution; neither component is sufficient on its own. However, if we train on skewed distributions, there is a sweet spot where both few-shot learning and in-weights memorization can be maintained at a high level in the same model (Zipf exponent = 1, for this particular training regime). Intriguingly, a Zipf exponent of 1 corresponds approximately to the skew in many natural languages. Rare items from training are never memorized (performance is at chance for all Zipf exponents) (e).

[1]  Hinrich Schütze,et al.  Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models , 2020, Proceedings of the National Academy of Sciences.

[2]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Stephen Clark,et al.  Grounded Language Learning Fast and Slow , 2020, ICLR.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  S. Gu,et al.  Can Wikipedia Help Offline Reinforcement Learning? , 2022, ArXiv.

[7]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[8]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  James L. McClelland,et al.  Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[11]  Zipfian environments for Reinforcement Learning , 2022, ArXiv.

[12]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[13]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[14]  J. Saffran,et al.  Infant Statistical Learning , 2018, Annual review of psychology.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  François Fleuret,et al.  Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[17]  Linda B. Smith,et al.  The Developing Infant Creates a Curriculum for Statistical Learning , 2018, Trends in Cognitive Sciences.

[18]  Paul H. Garthwaite,et al.  A Bayesian Mixture Model for Term Re-occurrence and Burstiness , 2005, CoNLL.

[19]  L. Squire Memory and the hippocampus: a synthesis from findings with rats, monkeys, and humans. , 1992, Psychological review.

[20]  J. Hopfield,et al.  Large Associative Memory Problem in Neurobiology and Machine Learning , 2020, ICLR.

[21]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[22]  Adilson E. Motter,et al.  Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words , 2009, PloS one.

[23]  Sang Michael Xie,et al.  An Explanation of In-context Learning as Implicit Bayesian Inference , 2021, ArXiv.

[24]  J-P Eckmann,et al.  Hierarchical structures induce long-range dynamical correlations in written texts. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[25]  David P. Kreil,et al.  Hopfield Networks is All You Need , 2020, ICLR.

[26]  Matt Gardner,et al.  Impact of Pretraining Term Frequencies on Few-Shot Reasoning , 2022, ArXiv.

[27]  James L. McClelland,et al.  What Learning Systems do Intelligent Agents Need? Complementary Learning Systems Theory Updated , 2016, Trends in Cognitive Sciences.

[28]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, EMNLP.

[29]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[30]  Ellie Pavlick,et al.  Do Prompt-Based Models Really Understand the Meaning of Their Prompts? , 2021, NAACL.

[31]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[32]  Joshua B. Tenenbaum,et al.  The Omniglot challenge: a 3-year progress report , 2019, Current Opinion in Behavioral Sciences.

[33]  Filippo Menczer,et al.  Modeling Statistical Properties of Written Text , 2009, PloS one.

[34]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[35]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[36]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[37]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.