Scaling Instruction-Finetuned Language Models

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, 1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. was used in the U-PaLM model. This result shows that instruction and UL2 continued pre-training are complementary compute-efficient methods to improve the performance of language models without increasing model scale. model on several self-generated synthetic CoT datasets. Compared to that work, we jointly finetune on CoT and non-CoT data and show that a single checkpoint can be used for both settings.

[1]  Quoc V. Le,et al.  Transcending Scaling Laws with 0.1% Extra Compute , 2022, EMNLP.

[2]  S. Gu,et al.  Large Language Models Can Self-Improve , 2022, EMNLP.

[3]  Kai-Wei Chang,et al.  The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks , 2022, ArXiv.

[4]  Quoc V. Le,et al.  Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[5]  Negar Rostamzadeh,et al.  Sociotechnical Harms: Scoping a Taxonomy for Harm Reduction , 2022, ArXiv.

[6]  Hyung Won Chung,et al.  Language Models are Multilingual Chain-of-Thought Reasoners , 2022, ICLR.

[7]  Partha P. Talukdar,et al.  Re-contextualizing Fairness in NLP: The Case of India , 2022, AACL.

[8]  S. Brown,et al.  Algorithmic Bias and Risk Assessments: Lessons from Practice , 2022, Digital Society.

[9]  Yuhuai Wu,et al.  Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[10]  Lisa Anne Hendricks,et al.  Taxonomy of Risks posed by Language Models , 2022, FAccT.

[11]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[12]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[13]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, ArXiv.

[14]  Eric Michael Smith,et al.  “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset , 2022, EMNLP.

[15]  Arun Tejasvi Chaganty,et al.  Dialog Inpainting: Turning Documents into Dialogs , 2022, ICML.

[16]  Z. Chen,et al.  Building Machine Translation Systems for the Next Thousand Languages , 2022, ArXiv.

[17]  I. Kivlichan,et al.  Is Your Toxicity My Toxicity? Exploring the Impact of Rater Identity on Toxicity Annotation , 2022, Proc. ACM Hum. Comput. Interact..

[18]  G. Karypis,et al.  Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning , 2022, NAACL.

[19]  Hyung Won Chung,et al.  What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , 2022, ICML.

[20]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[21]  Andrew Zaldivar,et al.  Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI , 2022, FAccT.

[22]  Marc van Zee,et al.  Scaling Up Models and Data with t5x and seqio , 2022, ArXiv.

[23]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[24]  Noah D. Goodman,et al.  STaR: Bootstrapping Reasoning With Reasoning , 2022, 2203.14465.

[25]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ArXiv.

[26]  Swaroop Mishra,et al.  How Many Data Samples is an Additional Instruction Worth? , 2022, FINDINGS.

[27]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[28]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[29]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[30]  Tanmoy Chakraborty,et al.  Handling Bias in Toxic Speech Detection: A Survey , 2022, ArXiv.

[31]  Quoc V. Le,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[32]  Noah A. Smith,et al.  Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection , 2021, NAACL.

[33]  M. Lewis,et al.  MetaICL: Learning to Learn In Context , 2021, NAACL.

[34]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[35]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[36]  Kai-Wei Chang,et al.  On Measures of Biases and Harms in NLP , 2021, AACL/IJCNLP.

[37]  Rami Al-Rfou,et al.  ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models , 2021, Transactions of the Association for Computational Linguistics.

[38]  Hannaneh Hajishirzi,et al.  Cross-Task Generalization via Natural Language Crowdsourcing Instructions , 2021, ACL.

[39]  Noah A. Smith,et al.  Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks , 2022, ArXiv.

[40]  Vinh Q. Tran,et al.  Unifying Language Learning Paradigms , 2022, ArXiv.

[41]  Rachel Rudinger,et al.  Recognition of They/Them as Singular Personal Pronouns in Coreference Resolution , 2022, NAACL.

[42]  S. Muresan,et al.  Continual-T0: Progressively Instructing 50+ Tasks to Language Models Without Forgetting , 2022, ArXiv.

[43]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[44]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[45]  Sebastian Gehrmann,et al.  SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets , 2021, ArXiv.

[46]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[47]  Eunsol Choi,et al.  CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge , 2021, NeurIPS Datasets and Benchmarks.

[48]  Anaelia Ovalle,et al.  Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies , 2021, EMNLP.

[49]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[50]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[51]  Xiang Ren,et al.  CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP , 2021, EMNLP.

[52]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[53]  Dan Klein,et al.  Detoxifying Language Models Risks Marginalizing Minority Voices , 2021, NAACL.

[54]  Dan Klein,et al.  Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections , 2021, EMNLP.

[55]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[56]  Timo Schick,et al.  Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP , 2021, Transactions of the Association for Computational Linguistics.

[57]  Sonal Gupta,et al.  Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[58]  Jonathan Berant,et al.  Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[59]  Zhucheng Tu,et al.  Open-Domain Question Answering Goes Conversational via Question Rewriting , 2020, NAACL.

[60]  Eunsol Choi,et al.  QED: A Framework and Dataset for Explanations in Question Answering , 2020, Transactions of the Association for Computational Linguistics.

[61]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[62]  Quoc V. Le,et al.  Searching for Efficient Transformers for Language Modeling , 2021, NeurIPS.

[63]  Dinesh Garg,et al.  Explanations for CommonsenseQA: New Dataset and Models , 2021, ACL.

[64]  Dana Dannélls,et al.  The Swedish Winogender Dataset , 2021, NODALIDA.

[65]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[66]  David Patterson,et al.  A domain-specific supercomputer for training deep neural networks , 2020, Commun. ACM.

[67]  Jonathan Berant,et al.  Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge , 2020, ArXiv.

[68]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[69]  Percy Liang,et al.  Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[70]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[71]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[72]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[73]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[74]  Ashish Sabharwal,et al.  QASC: A Dataset for Question Answering via Sentence Composition , 2019, AAAI.

[75]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[76]  Thomas Lukasiewicz,et al.  Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations , 2019, ACL.

[77]  Bill Byrne,et al.  Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset , 2019, EMNLP.

[78]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[79]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[80]  Yue Zhang,et al.  Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation , 2019, ACL.

[81]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[82]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[83]  Inioluwa Deborah Raji,et al.  Model Cards for Model Reporting , 2018, FAT.

[84]  Chelsea Lee Welcome, Singular "They" , 2019, PsycEXTRA Dataset.

[85]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[86]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[87]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[88]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[89]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[90]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[91]  Philip A.E. Brey,et al.  Anticipatory Ethics for Emerging Technologies , 2012 .