Emergent Abilities of Large Language Models

Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The exis-tence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

[1]  Quoc V. Le,et al.  Transcending Scaling Laws with 0.1% Extra Compute , 2022, 2210.11399.

[2]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[3]  Quoc V. Le,et al.  Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[4]  Hyung Won Chung,et al.  Language Models are Multilingual Chain-of-Thought Reasoners , 2022, ICLR.

[5]  Mayee F. Chen,et al.  Ask Me Anything: A simple strategy for prompting language models , 2022, ArXiv.

[6]  Tom B. Brown,et al.  In-context Learning and Induction Heads , 2022, ArXiv.

[7]  Tom B. Brown,et al.  Language Models (Mostly) Know What They Know , 2022, ArXiv.

[8]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[9]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, ArXiv.

[10]  D. Schuurmans,et al.  Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, International Conference on Learning Representations.

[11]  Christopher D. Manning Human Language Understanding & Reasoning , 2022, Daedalus.

[12]  J. Dean,et al.  Designing Effective Sparse Expert Models , 2022, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[13]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, ArXiv.

[14]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[15]  Hyung Won Chung,et al.  What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , 2022, ICML.

[16]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[17]  James L. McClelland,et al.  Can language models learn from explanations in context? , 2022, EMNLP.

[18]  S. Levine,et al.  Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.

[19]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[20]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[21]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ArXiv.

[22]  Carrie J. Cai,et al.  PromptChainer: Chaining Large Language Model Prompts through Visual Programming , 2022, CHI Extended Abstracts.

[23]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[24]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[25]  Tom B. Brown,et al.  Predictability and Surprise in Large Generative Models , 2022, FAccT.

[26]  Quantifying Memorization Across Neural Language Models , 2022, ArXiv.

[27]  Matt Gardner,et al.  Impact of Pretraining Term Frequencies on Few-Shot Reasoning , 2022, ArXiv.

[28]  William W. Cohen,et al.  Transformer Memory as a Differentiable Search Index , 2022, NeurIPS.

[29]  Deduplicating Training Data Mitigates Privacy Risks in Language Models , 2022, 2202.06539.

[30]  Geoffrey Irving,et al.  Red Teaming Language Models with Language Models , 2022, EMNLP.

[31]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[32]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[33]  P. Abbeel,et al.  Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , 2022, ICML.

[34]  Percy Liang,et al.  CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities , 2022, CHI.

[35]  Quoc V. Le,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[36]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[37]  Sang Michael Xie,et al.  An Explanation of In-context Learning as Implicit Bayesian Inference , 2021, ICLR.

[38]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[39]  Phu Mon Htut,et al.  BBQ: A hand-built bias benchmark for question answering , 2021, FINDINGS.

[40]  Carrie J. Cai,et al.  AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts , 2021, CHI.

[41]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[42]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[43]  Luke Zettlemoyer,et al.  Noisy Channel Language Model Prompting for Few-Shot Text Classification , 2021, ACL.

[44]  Nicholas Carlini,et al.  Deduplicating Training Data Makes Language Models Better , 2021, ACL.

[45]  Shachar Mirkin,et al.  Emergent Structures and Training Dynamics in Large Language Models , 2022, BIGSCIENCE.

[46]  Vinh Q. Tran,et al.  Unifying Language Learning Paradigms , 2022, ArXiv.

[47]  Ellie Pavlick,et al.  Mapping Language Models to Grounded Conceptual Spaces , 2022, ICLR.

[48]  Xi Victoria Lin,et al.  Efficient Large Scale Language Modeling with Mixtures of Experts , 2021, EMNLP.

[49]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[50]  Po-Sen Huang,et al.  Ethical and social risks of harm from Language Models , 2021, ArXiv.

[51]  Dario Amodei,et al.  A General Language Assistant as a Laboratory for Alignment , 2021, ArXiv.

[52]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[53]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[54]  Nicholas Carlini,et al.  Unsolved Problems in ML Safety , 2021, ArXiv.

[55]  Ellie Pavlick,et al.  Frequency Effects on Syntactic Rule Learning in Transformers , 2021, EMNLP.

[56]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[57]  Daphne Ippolito,et al.  Wordcraft: a Human-AI Collaborative Editor for Story Writing , 2021, ArXiv.

[58]  Sang Michael Xie,et al.  Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning , 2021, NeurIPS.

[59]  Luke Zettlemoyer,et al.  Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , 2021, EMNLP.

[60]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[61]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[62]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[63]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[64]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[65]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[66]  Samuel R. Bowman,et al.  When Do You Need Billions of Words of Pretraining Data? , 2020, ACL.

[67]  Sanjeev Arora,et al.  A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks , 2020, ICLR.

[68]  Hinrich Schütze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[69]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[70]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[71]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[72]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[73]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[74]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[75]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[76]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[77]  José Camacho-Collados,et al.  WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[78]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[79]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[80]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[81]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[82]  Nathaniel J. Smith,et al.  Bootstrapping language acquisition , 2017, Cognition.

[83]  Chandler May,et al.  Social Bias in Elicited Natural Language Inferences , 2017, EthNLP@EACL.

[84]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[85]  Alex Graves,et al.  Decoupled Neural Interfaces using Synthetic Gradients , 2016, ICML.

[86]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[87]  Paul A. Lewis,et al.  New Perspectives on Emergence in Economics , 2012 .

[88]  H. Hwang,et al.  BASIC NOTIONS , 2022 .

[89]  Timothy O'Connor,et al.  Emergence in Science and Philosophy , 2010 .

[90]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[91]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[92]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[93]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[94]  Stephanie Forrest,et al.  Emergent computation: self-organizing, collective, and cooperative phenomena in natural and artificial computing networks , 1990 .

[95]  Tad Hogg,et al.  Phase Transitions in Artificial Intelligence Systems , 1987, Artif. Intell..

[96]  Philip W. Anderson,et al.  More Is Different Broken symmetry and the nature of the hierarchical structure of science , 1972 .