Language Models are Few-Shot Learners

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

[1]  Susan Carey,et al.  Acquiring a Single New Word , 1978 .

[2]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[3]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[4]  Yaroslav Fyodorov,et al.  A Natural Logic Inference System , 2000 .

[5]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[6]  Jeffrey P. Bigham,et al.  Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems , 2003, ArXiv.

[7]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[8]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[9]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[10]  Michael L. Littman,et al.  Corpus-based Learning of Analogies and Semantic Relations , 2005, Machine Learning.

[11]  Roy Bar-Haim,et al.  The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[12]  J Quinonero Candela,et al.  Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment , 2006, Lecture Notes in Computer Science.

[13]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[14]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[15]  Peter Clark,et al.  The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.

[16]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[17]  Ronald S. Ross,et al.  Guide for Conducting Risk Assessments , 2012 .

[18]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[21]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[22]  Nadir Durrani,et al.  Edinburgh’s Phrase-based Machine Translation Systems for WMT-14 , 2014, WMT@ACL.

[23]  Richard Socher,et al.  A Neural Network for Factoid Question Answering over Paragraphs , 2014, EMNLP.

[24]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[25]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[26]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[27]  Xiaodong Liu,et al.  Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval , 2015, NAACL.

[28]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[29]  Nathanael Chambers,et al.  A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[30]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[31]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[32]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[33]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[34]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[35]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[36]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[39]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[40]  Jitendra Malik,et al.  Learning to Optimize Neural Nets , 2017, ArXiv.

[41]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[42]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[43]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[44]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[45]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[46]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[47]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[48]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[49]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[50]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[51]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[52]  Rachel Rudinger,et al.  Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.

[53]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[54]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[55]  José Camacho-Collados,et al.  WiC: 10, 000 Example Pairs for Evaluating Context-Sensitive Representations , 2018, NAACL 2019.

[56]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[57]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[58]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[59]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[60]  Xiaodong Liu,et al.  ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[61]  Yong Wang,et al.  Meta-Learning for Low-Resource Neural Machine Translation , 2018, EMNLP.

[62]  Luke S. Zettlemoyer,et al.  Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[63]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[64]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[65]  Thomas Paine,et al.  Few-shot Autoregressive Density Estimation: Towards Learning to Learn Distributions , 2017, ICLR.

[66]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[67]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[68]  Dario Amodei,et al.  An Empirical Model of Large-Batch Training , 2018, ArXiv.

[69]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[70]  Ting Liu,et al.  Story Ending Prediction by Transferable BERT , 2019, IJCAI.

[71]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[72]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[73]  Lei Yu,et al.  Learning and Evaluating General Linguistic Intelligence , 2019, ArXiv.

[74]  Jie Ren,et al.  SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders , 2019, ArXiv.

[75]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[76]  Judith Tonhauser,et al.  The CommitmentBank: Investigating projection in naturally occurring discourse , 2019 .

[77]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[78]  Tie-Yan Liu,et al.  Multi-Agent Dual Learning , 2019, ICLR.

[79]  Nanyun Peng,et al.  The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[80]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[81]  Yusu Qian,et al.  Reducing Gender Bias in Word-Level Language Models with a Gender-Equalizing Loss Function , 2019, ACL.

[82]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[83]  Shijie Chen,et al.  Technical report on Conversational Question Answering , 2019, ArXiv.

[84]  Zhiyuan Liu,et al.  NumNet: Machine Reading Comprehension with Numerical Reasoning , 2019, EMNLP.

[85]  Tom B. Brown,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[86]  Xiaodong Liu,et al.  Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , 2019, ArXiv.

[87]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[88]  Alexander M. Rush,et al.  GLTR: Statistical Detection and Visualization of Generated Text , 2019, ACL.

[89]  Orhan Firat,et al.  Massively Multilingual Neural Machine Translation , 2019, NAACL.

[90]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[91]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[92]  Yoav Goldberg,et al.  Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.

[93]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[94]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[95]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[96]  José Camacho-Collados,et al.  WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[97]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[98]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[99]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[100]  Alec Radford,et al.  Release Strategies and the Social Impacts of Language Models , 2019, ArXiv.

[101]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[102]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[103]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[104]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[105]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[106]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[107]  Hung-Yu Kao,et al.  Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[108]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[109]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[110]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[111]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[112]  Jianfeng Gao,et al.  Adversarial Training for Large Neural Language Models , 2020, ArXiv.

[113]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[114]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[115]  Ronan Le Bras,et al.  WinoGrande , 2019, AAAI.

[116]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[117]  Rik van Noord,et al.  Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor , 2019, CL.

[118]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[119]  Po-Sen Huang,et al.  Reducing Sentiment Bias in Language Models via Counterfactual Evaluation , 2019, FINDINGS.

[120]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[121]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[122]  Chris Callison-Burch,et al.  Human and Automatic Detection of Generated Text , 2019, ArXiv.

[123]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[124]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[125]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[126]  Ming-Feng Tsai,et al.  TTTTTackling WinoGrande Schemas , 2020, ArXiv.

[127]  S. Kreps,et al.  All the News That’s Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation , 2020, Journal of Experimental Political Science.

[128]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[129]  Jacob Andreas,et al.  Experience Grounds Language , 2020, EMNLP.

[130]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[131]  Tal Linzen,et al.  How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[132]  Jonathan S. Rosenfeld,et al.  A Constructive Prediction of the Generalization Error Across Scales , 2019, ICLR.

[133]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[134]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[135]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[136]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[137]  Siva Reddy,et al.  StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.