论文信息 - Language Models are Few-Shot Learners - 字舞流文

Language Models are Few-Shot Learners

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

Mark Chen | Ilya Sutskever | Alec Radford | Prafulla Dhariwal | Aditya Ramesh | Dario Amodei | Christopher Hesse | Arvind Neelakantan | Tom Henighan | Jared Kaplan | Tom B. Brown | Scott Gray | Nick Ryder | Daniel M. Ziegler | Sam McCandlish | Rewon Child | Girish Sastry | Ariel Herbert-Voss | Jeffrey Wu | Christopher Berner | Pranav Shyam | Gretchen Krueger | Jack Clark | Amanda Askell | Mateusz Litwin | Benjamin Mann | Melanie Subbiah | Sandhini Agarwal | Clemens Winter | Eric Sigler | Benjamin Chess | Daniel M. Ziegler | Jeff Wu | Alec Radford | Dario Amodei | Prafulla Dhariwal | Ilya Sutskever | Christopher Hesse | T. Henighan | J. Kaplan | Mark Chen | Scott Gray | Benjamin Mann | A. Ramesh | Nick Ryder | Sam McCandlish | Rewon Child | Arvind Neelakantan | Benjamin Chess | Melanie Subbiah | Pranav Shyam | Girish Sastry | Amanda Askell | Sandhini Agarwal | Ariel Herbert-Voss | Gretchen Krueger | Clemens Winter | Eric Sigler | Mateusz Litwin | Jack Clark | Christopher Berner | S. Gray | I. Sutskever | B. Chess | R. Child

[1] Susan Carey,et al. Acquiring a Single New Word , 1978 .

[2] David J. C. MacKay,et al. Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[3] Ann Bies,et al. The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[4] Yaroslav Fyodorov,et al. A Natural Logic Inference System , 2000 .

[5] Sepp Hochreiter,et al. Learning to Learn Using Gradient Descent , 2001, ICANN.

[6] Jeffrey P. Bigham,et al. Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems , 2003, ArXiv.

[7] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[8] Rich Caruana,et al. Multitask Learning , 1997, Machine Learning.

[9] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[10] Michael L. Littman,et al. Corpus-based Learning of Analogies and Semantic Relations , 2005, Machine Learning.

[11] Roy Bar-Haim,et al. The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[12] J Quinonero Candela,et al. Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment , 2006, Lecture Notes in Computer Science.

[13] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[14] Andrea Esuli,et al. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[15] Peter Clark,et al. The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.

[16] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[17] Ronald S. Ross,et al. Guide for Conducting Risk Assessments , 2012 .

[18] Zornitsa Kozareva,et al. SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[19] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[21] Andrew Chou,et al. Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[22] Nadir Durrani,et al. Edinburgh’s Phrase-based Machine Translation Systems for WMT-14 , 2014, WMT@ACL.

[23] Richard Socher,et al. A Neural Network for Factoid Question Answering over Paragraphs , 2014, EMNLP.

[24] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[25] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[26] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[27] Xiaodong Liu,et al. Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval , 2015, NAACL.

[28] Alex Graves,et al. Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[29] Nathanael Chambers,et al. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[30] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.

[31] Marcin Andrychowicz,et al. Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[32] Alexander M. Rush,et al. Sequence-Level Knowledge Distillation , 2016, EMNLP.

[33] Sandro Pezzelle,et al. The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[34] Peter L. Bartlett,et al. RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[35] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[36] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[38] Hugo Larochelle,et al. Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[39] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[40] Jitendra Malik,et al. Learning to Optimize Neural Nets , 2017, ArXiv.

[41] Wojciech Zaremba,et al. Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[42] Guokun Lai,et al. RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[43] Yang Yang,et al. Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[44] Richard Socher,et al. Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[45] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[46] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[47] Peter Clark,et al. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[48] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[49] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[50] Eunsol Choi,et al. QuAC: Question Answering in Context , 2018, EMNLP.

[51] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[52] Rachel Rudinger,et al. Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.

[53] Lukasz Kaiser,et al. Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[54] Myle Ott,et al. Understanding Back-Translation at Scale , 2018, EMNLP.

[55] José Camacho-Collados,et al. WiC: 10, 000 Example Pairs for Evaluating Context-Sensitive Representations , 2018, NAACL 2019.

[56] Oren Etzioni,et al. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[57] Richard Socher,et al. The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[58] Rachel Rudinger,et al. Gender Bias in Coreference Resolution , 2018, NAACL.

[59] Dan Roth,et al. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[60] Xiaodong Liu,et al. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[61] Yong Wang,et al. Meta-Learning for Low-Resource Neural Machine Translation , 2018, EMNLP.

[62] Luke S. Zettlemoyer,et al. Dissecting Contextual Word Embeddings: Architecture and Representation , 2018, EMNLP.

[63] Samuel R. Bowman,et al. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[64] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[65] Thomas Paine,et al. Few-shot Autoregressive Density Estimation: Towards Learning to Learn Distributions , 2017, ICLR.

[66] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[67] Quoc V. Le,et al. A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[68] Dario Amodei,et al. An Empirical Model of Large-Batch Training , 2018, ArXiv.

[69] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[70] Ting Liu,et al. Story Ending Prediction by Transferable BERT , 2019, IJCAI.

[71] Xu Tan,et al. MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[72] Ming-Wei Chang,et al. Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[73] Lei Yu,et al. Learning and Evaluating General Linguistic Intelligence , 2019, ArXiv.

[74] Jie Ren,et al. SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders , 2019, ArXiv.

[75] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[76] Judith Tonhauser,et al. The CommitmentBank: Investigating projection in naturally occurring discourse , 2019 .

[77] R. Thomas McCoy,et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[78] Tie-Yan Liu,et al. Multi-Agent Dual Learning , 2019, ICLR.

[79] Nanyun Peng,et al. The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[80] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[81] Yusu Qian,et al. Reducing Gender Bias in Word-Level Language Models with a Gender-Equalizing Loss Function , 2019, ACL.

[82] Xiaodong Liu,et al. Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[83] Shijie Chen,et al. Technical report on Conversational Question Answering , 2019, ArXiv.

[84] Zhiyuan Liu,et al. NumNet: Machine Reading Comprehension with Numerical Reasoning , 2019, EMNLP.

[85] Tom B. Brown,et al. Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[86] Xiaodong Liu,et al. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , 2019, ArXiv.

[87] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[88] Alexander M. Rush,et al. GLTR: Statistical Detection and Visualization of Generated Text , 2019, ACL.

[89] Orhan Firat,et al. Massively Multilingual Neural Machine Translation , 2019, NAACL.

[90] Ali Farhadi,et al. HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[91] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[92] Yoav Goldberg,et al. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.

[93] Ming-Wei Chang,et al. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[94] Danqi Chen,et al. CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[95] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[96] José Camacho-Collados,et al. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[97] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.

[98] Ali Farhadi,et al. Defending Against Neural Fake News , 2019, NeurIPS.

[99] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[100] Alec Radford,et al. Release Strategies and the Social Impacts of Language Models , 2019, ArXiv.

[101] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[102] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[103] Gabriel Stanovsky,et al. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[104] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[105] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[106] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[107] Hung-Yu Kao,et al. Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[108] Yejin Choi,et al. PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[109] Xin Jiang,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[110] Quoc V. Le,et al. Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[111] Dan Klein,et al. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[112] Jianfeng Gao,et al. Adversarial Training for Large Neural Language Models , 2020, ArXiv.

[113] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[114] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[115] Ronan Le Bras,et al. WinoGrande , 2019, AAAI.

[116] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[117] Rik van Noord,et al. Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor , 2019, CL.

[118] Fabio Petroni,et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[119] Po-Sen Huang,et al. Reducing Sentiment Bias in Language Models via Counterfactual Evaluation , 2019, FINDINGS.

[120] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[121] Colin Raffel,et al. How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[122] Chris Callison-Burch,et al. Human and Automatic Detection of Generated Text , 2019, ArXiv.

[123] Ming-Wei Chang,et al. REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[124] J. Weston,et al. Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[125] Dawn Song,et al. Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[126] Ming-Feng Tsai,et al. TTTTTackling WinoGrande Schemas , 2020, ArXiv.

[127] S. Kreps,et al. All the News That’s Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation , 2020, Journal of Experimental Political Science.

[128] Hannaneh Hajishirzi,et al. UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[129] Jacob Andreas,et al. Experience Grounds Language , 2020, EMNLP.

[130] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[131] Tal Linzen,et al. How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[132] Jonathan S. Rosenfeld,et al. A Constructive Prediction of the Generalization Error Across Scales , 2019, ICLR.

[133] Solon Barocas,et al. Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[134] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[135] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[136] Timo Schick,et al. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[137] Siva Reddy,et al. StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.