论文信息 - Scaling Instruction-Finetuned Language Models - 字舞流文

Scaling Instruction-Finetuned Language Models

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction ﬁnetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) ﬁnetuning on chain-of-thought data. We ﬁnd that instruction ﬁnetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-ﬁnetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on ﬁve-shot MMLU. We also publicly release Flan-T5 checkpoints, 1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction ﬁnetuning is a general method for improving the performance and usability of pretrained language models. was used in the U-PaLM model. This result shows that instruction and UL2 continued pre-training are complementary compute-eﬃcient methods to improve the performance of language models without increasing model scale. model on several self-generated synthetic CoT datasets. Compared to that work, we jointly ﬁnetune on CoT and non-CoT data and show that a single checkpoint can be used for both settings.

Andrew M. Dai | Hyung Won Chung | J. Dean | Yanping Huang | Xinyun Chen | S. Gu | Jacob Devlin | Xuezhi Wang | Slav Petrov | W. Fedus | Sharan Narang | Yi Tay | Barret Zoph | Yanping Huang | Ed Chi | Adam Roberts | Denny Zhou | Zhuyun Dai | Hongkun Yu | M. Dehghani | Aakanksha Chowdhery | Jason Wei | S. Longpre | Siddhartha Brahma | Jason Wei | Mirac Suzgun | Albert Webson | Vincent Zhao | Gaurav Mishra | Le Hou | Eric Li | Xinyun Chen | A. Yu | Denny Zhou | Quoc Le | E. Chi | Dasha Valter | Quoc V. Le | Mostafa Dehghani

[1] Quoc V. Le,et al. Transcending Scaling Laws with 0.1% Extra Compute , 2022, 2210.11399.

[2] S. Gu,et al. Large Language Models Can Self-Improve , 2022, 2210.11610.

[3] Quoc V. Le,et al. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[4] Hyung Won Chung,et al. Language Models are Multilingual Chain-of-Thought Reasoners , 2022, ICLR.

[5] Yuhuai Wu,et al. Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[6] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, ArXiv.

[7] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[8] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, ArXiv.

[9] Arun Tejasvi Chaganty,et al. Dialog Inpainting: Turning Documents into Dialogs , 2022, ICML.

[10] G. Karypis,et al. Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning , 2022, NAACL.

[11] Kuntal Kumar Pal,et al. Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks , 2022, ArXiv.

[12] Hyung Won Chung,et al. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , 2022, ICML.

[13] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[14] Andrew Zaldivar,et al. Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI , 2022, FAccT.

[15] Marc van Zee,et al. Scaling Up Models and Data with t5x and seqio , 2022, J. Mach. Learn. Res..

[16] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[17] Noah D. Goodman,et al. STaR: Bootstrapping Reasoning With Reasoning , 2022, 2203.14465.

[18] D. Schuurmans,et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ArXiv.

[19] Swaroop Mishra,et al. How Many Data Samples is an Additional Instruction Worth? , 2022, FINDINGS.

[20] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[21] Cherepanov,et al. Competition-level code generation with AlphaCode , 2022, Science.

[22] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[23] Quoc V. Le,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[24] M. Lewis,et al. MetaICL: Learning to Learn In Context , 2021, NAACL.

[25] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[26] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[27] Rami Al-Rfou,et al. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models , 2021, Transactions of the Association for Computational Linguistics.

[28] Vinh Q. Tran,et al. Unifying Language Learning Paradigms , 2022, ArXiv.

[29] S. Muresan,et al. Continual-T0: Progressively Instructing 50+ Tasks to Language Models Without Forgetting , 2022, ArXiv.

[30] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[31] David Bieber,et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[32] Mohammad Bavarian,et al. Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[33] Eunsol Choi,et al. CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge , 2021, NeurIPS Datasets and Benchmarks.

[34] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[35] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[36] Xiang Ren,et al. CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP , 2021, EMNLP.

[37] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[38] Dan Klein,et al. Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections , 2021, EMNLP.

[39] Sonal Gupta,et al. Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[40] Jonathan Berant,et al. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[41] Zhucheng Tu,et al. Open-Domain Question Answering Goes Conversational via Question Rewriting , 2020, NAACL.

[42] Eunsol Choi,et al. QED: A Framework and Dataset for Explanations in Question Answering , 2020, Transactions of the Association for Computational Linguistics.

[43] Dawn Song,et al. Measuring Massive Multitask Language Understanding , 2020, ICLR.

[44] Quoc V. Le,et al. Searching for Efficient Transformers for Language Modeling , 2021, NeurIPS.

[45] Dinesh Garg,et al. Explanations for CommonsenseQA: New Dataset and Models , 2021, ACL.

[46] David Patterson,et al. A domain-specific supercomputer for training deep neural networks , 2020, Commun. ACM.

[47] Jonathan Berant,et al. Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge , 2020, ArXiv.

[48] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[49] Percy Liang,et al. Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[50] Hannaneh Hajishirzi,et al. UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[51] Eunsol Choi,et al. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[52] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[53] Ashish Sabharwal,et al. QASC: A Dataset for Question Answering via Sentence Composition , 2019, AAAI.

[54] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[55] Thomas Lukasiewicz,et al. Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations , 2019, ACL.

[56] Bill Byrne,et al. Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset , 2019, EMNLP.

[57] Richard Socher,et al. Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[58] Yue Zhang,et al. Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation , 2019, ACL.

[59] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[60] Inioluwa Deborah Raji,et al. Model Cards for Model Reporting , 2018, FAT.

[61] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[62] Thomas Lukasiewicz,et al. e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[63] Rachel Rudinger,et al. Gender Bias in Coreference Resolution , 2018, NAACL.

[64] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[65] Wang Ling,et al. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.