论文信息 - Data Distributional Properties Drive Emergent In-Context Learning in Transformers

Data Distributional Properties Drive Emergent In-Context Learning in Transformers

Large transformer-based language models are able to perform few-shot learning (also known as in-context learning), without having been explicitly trained for it. We hypothesized that speciﬁc distributional properties of natural language might drive this emergent phenomenon, as these characteristics might lead to a kind of interpolation between few-shot meta-training (designed to elicit rapid few-shot learning) and standard supervised training (designed to elicit gradual in-weights learning). We also hypothesized that these distributional properties could lead to emergent few-shot learning in domains outside of language. Inspired by this idea, we ran a series of experiments on a standard image-based few-shot dataset. We discovered that a number of data properties did indeed promote the emergence of few-shot learning in transformer models. All of these properties are present in natural language – burstiness, long-tailedness, and many-to-one or one-to-many label mappings. The data inﬂuenced whether models were biased towards either few-shot learning vs. memorizing information in their weights; models could generally perform well at only one or the other. However, we discovered that an additional distributional property could allow the two capabilities to co-exist in the same model – a skewed, Zipﬁan distribution over classes – which occurs in language as well. Notably, training data that could elicit few-shot learning in transformers were unable to elicit few-shot learning in recurrent models. In sum, we ﬁnd that few-shot learning emerges only from applying the right architecture to the right data distribution; neither component is suﬃcient on its own. However, if we train on skewed distributions, there is a sweet spot where both few-shot learning and in-weights memorization can be maintained at a high level in the same model (Zipf exponent = 1, for this particular training regime). Intriguingly, a Zipf exponent of 1 corresponds approximately to the skew in many natural languages. Rare items from training are never memorized (performance is at chance for all Zipf exponents) (e).

[1] Andrew Kyle Lampinen,et al. Zipfian environments for Reinforcement Learning , 2022, CoLLAs.

[2] M. Lewis,et al. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, EMNLP.

[3] Matt Gardner,et al. Impact of Pretraining Term Frequencies on Few-Shot Reasoning , 2022, ArXiv.

[4] S. Gu,et al. Can Wikipedia Help Offline Reinforcement Learning? , 2022, ArXiv.

[5] Sang Michael Xie,et al. An Explanation of In-context Learning as Implicit Bayesian Inference , 2021, ICLR.

[6] Ellie Pavlick,et al. Do Prompt-Based Models Really Understand the Meaning of Their Prompts? , 2021, NAACL.

[7] Stephen Clark,et al. Grounded Language Learning Fast and Slow , 2020, ICLR.

[8] J. Hopfield,et al. Large Associative Memory Problem in Neurobiology and Machine Learning , 2020, ICLR.

[9] David P. Kreil,et al. Hopfield Networks is All You Need , 2020, ICLR.

[10] Hinrich Schütze,et al. Placing language in an integrated understanding system: Next steps toward human-level performance in neural language models , 2020, Proceedings of the National Academy of Sciences.

[11] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[12] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[13] Joshua B. Tenenbaum,et al. The Omniglot challenge: a 3-year progress report , 2019, Current Opinion in Behavioral Sciences.

[14] Linda B. Smith,et al. The Developing Infant Creates a Curriculum for Statistical Learning , 2018, Trends in Cognitive Sciences.

[15] François Fleuret,et al. Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[16] J. Saffran,et al. Infant Statistical Learning , 2018, Annual review of psychology.

[17] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[18] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[19] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[20] Zeb Kurth-Nelson,et al. Learning to reinforcement learn , 2016, CogSci.

[21] James L. McClelland,et al. What Learning Systems do Intelligent Agents Need? Complementary Learning Systems Theory Updated , 2016, Trends in Cognitive Sciences.

[22] Daan Wierstra,et al. Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[23] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[24] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] S. Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[26] Jean-Charles Delvenne,et al. Burstiness and spreading on temporal networks , 2013, ArXiv.

[27] Filippo Menczer,et al. Modeling Statistical Properties of Written Text , 2009, PloS one.

[28] Adilson E. Motter,et al. Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words , 2009, PloS one.

[29] Taghi M. Khoshgoftaar,et al. Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[30] J-P Eckmann,et al. Hierarchical structures induce long-range dynamical correlations in written texts. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[31] Paul H. Garthwaite,et al. A Bayesian Mixture Model for Term Re-occurrence and Burstiness , 2005, CoNLL.

[32] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[33] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[34] James L. McClelland,et al. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. , 1995, Psychological review.

[35] M. Neuts. The burstiness of point processes , 1993 .

[36] L. Squire. Memory and the hippocampus: a synthesis from findings with rats, monkeys, and humans. , 1992, Psychological review.

[37] David Yarowsky,et al. One Sense Per Discourse , 1992, HLT.

[38] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[39] George Kingsley Zipf,et al. Human behavior and the principle of least effort , 1949 .