Scaling Instruction-Finetuned Language Models

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, 1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. was used in the U-PaLM model. This result shows that instruction and UL2 continued pre-training are complementary compute-efficient methods to improve the performance of language models without increasing model scale. model on several self-generated synthetic CoT datasets. Compared to that work, we jointly finetune on CoT and non-CoT data and show that a single checkpoint can be used for both settings.

[1]  Quoc V. Le,et al.  Transcending Scaling Laws with 0.1% Extra Compute , 2022, 2210.11399.

[2]  S. Gu,et al.  Large Language Models Can Self-Improve , 2022, 2210.11610.

[3]  Quoc V. Le,et al.  Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[4]  Hyung Won Chung,et al.  Language Models are Multilingual Chain-of-Thought Reasoners , 2022, ICLR.

[5]  Yuhuai Wu,et al.  Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[6]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, ArXiv.

[7]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[8]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, ArXiv.

[9]  Arun Tejasvi Chaganty,et al.  Dialog Inpainting: Turning Documents into Dialogs , 2022, ICML.

[10]  G. Karypis,et al.  Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning , 2022, NAACL.

[11]  Kuntal Kumar Pal,et al.  Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks , 2022, ArXiv.

[12]  Hyung Won Chung,et al.  What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , 2022, ICML.

[13]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[14]  Andrew Zaldivar,et al.  Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI , 2022, FAccT.

[15]  Marc van Zee,et al.  Scaling Up Models and Data with t5x and seqio , 2022, ArXiv.

[16]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[17]  Noah D. Goodman,et al.  STaR: Bootstrapping Reasoning With Reasoning , 2022, 2203.14465.

[18]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ArXiv.

[19]  Swaroop Mishra,et al.  How Many Data Samples is an Additional Instruction Worth? , 2022, FINDINGS.

[20]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[21]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[22]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[23]  Quoc V. Le,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[24]  M. Lewis,et al.  MetaICL: Learning to Learn In Context , 2021, NAACL.

[25]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[26]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[27]  Rami Al-Rfou,et al.  ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models , 2021, Transactions of the Association for Computational Linguistics.

[28]  Vinh Q. Tran,et al.  Unifying Language Learning Paradigms , 2022, ArXiv.

[29]  S. Muresan,et al.  Continual-T0: Progressively Instructing 50+ Tasks to Language Models Without Forgetting , 2022, ArXiv.

[30]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[31]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[32]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[33]  Eunsol Choi,et al.  CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge , 2021, NeurIPS Datasets and Benchmarks.

[34]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[35]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[36]  Xiang Ren,et al.  CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP , 2021, EMNLP.

[37]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[38]  Dan Klein,et al.  Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections , 2021, EMNLP.

[39]  Sonal Gupta,et al.  Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[40]  Jonathan Berant,et al.  Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[41]  Zhucheng Tu,et al.  Open-Domain Question Answering Goes Conversational via Question Rewriting , 2020, NAACL.

[42]  Eunsol Choi,et al.  QED: A Framework and Dataset for Explanations in Question Answering , 2020, Transactions of the Association for Computational Linguistics.

[43]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[44]  Quoc V. Le,et al.  Searching for Efficient Transformers for Language Modeling , 2021, NeurIPS.

[45]  Dinesh Garg,et al.  Explanations for CommonsenseQA: New Dataset and Models , 2021, ACL.

[46]  David Patterson,et al.  A domain-specific supercomputer for training deep neural networks , 2020, Commun. ACM.

[47]  Jonathan Berant,et al.  Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge , 2020, ArXiv.

[48]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[49]  Percy Liang,et al.  Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[50]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[51]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[52]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[53]  Ashish Sabharwal,et al.  QASC: A Dataset for Question Answering via Sentence Composition , 2019, AAAI.

[54]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[55]  Thomas Lukasiewicz,et al.  Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations , 2019, ACL.

[56]  Bill Byrne,et al.  Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset , 2019, EMNLP.

[57]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[58]  Yue Zhang,et al.  Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation , 2019, ACL.

[59]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[60]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[61]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[62]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[63]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[64]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[65]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.