论文信息 - Scaling Instruction-Finetuned Language Models - 字舞流文

Scaling Instruction-Finetuned Language Models

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction ﬁnetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) ﬁnetuning on chain-of-thought data. We ﬁnd that instruction ﬁnetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-ﬁnetuned on 1.8K tasks outperforms PaLM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on ﬁve-shot MMLU. We also publicly release Flan-T5 checkpoints, 1 which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction ﬁnetuning is a general method for improving the performance and usability of pretrained language models. was used in the U-PaLM model. This result shows that instruction and UL2 continued pre-training are complementary compute-eﬃcient methods to improve the performance of language models without increasing model scale. model on several self-generated synthetic CoT datasets. Compared to that work, we jointly ﬁnetune on CoT and non-CoT data and show that a single checkpoint can be used for both settings.

Andrew M. Dai | Hyung Won Chung | J. Dean | Xinyun Chen | S. Gu | Jacob Devlin | Xuezhi Wang | Slav Petrov | W. Fedus | Sharan Narang | Yi Tay | Barret Zoph | Yanping Huang | Ed Chi | Adam Roberts | Denny Zhou | Zhuyun Dai | Hongkun Yu | M. Dehghani | Aakanksha Chowdhery | S. Longpre | Siddhartha Brahma | Jason Wei | Mirac Suzgun | Albert Webson | Vincent Zhao | Gaurav Mishra | Le Hou | Eric Li | A. Yu | Quoc Le | E. Chi

[1] Quoc V. Le,et al. Transcending Scaling Laws with 0.1% Extra Compute , 2022, EMNLP.

[2] S. Gu,et al. Large Language Models Can Self-Improve , 2022, EMNLP.

[3] Kai-Wei Chang,et al. The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks , 2022, ArXiv.

[4] Quoc V. Le,et al. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[5] Negar Rostamzadeh,et al. Sociotechnical Harms: Scoping a Taxonomy for Harm Reduction , 2022, ArXiv.

[6] Hyung Won Chung,et al. Language Models are Multilingual Chain-of-Thought Reasoners , 2022, ICLR.

[7] Partha P. Talukdar,et al. Re-contextualizing Fairness in NLP: The Case of India , 2022, AACL.

[8] S. Brown,et al. Algorithmic Bias and Risk Assessments: Lessons from Practice , 2022, Digital Society.

[9] Yuhuai Wu,et al. Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[10] Lisa Anne Hendricks,et al. Taxonomy of Risks posed by Language Models , 2022, FAccT.

[11] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[12] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[13] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, ArXiv.

[14] Eric Michael Smith,et al. “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset , 2022, EMNLP.

[15] Arun Tejasvi Chaganty,et al. Dialog Inpainting: Turning Documents into Dialogs , 2022, ICML.

[16] Z. Chen,et al. Building Machine Translation Systems for the Next Thousand Languages , 2022, ArXiv.

[17] I. Kivlichan,et al. Is Your Toxicity My Toxicity? Exploring the Impact of Rater Identity on Toxicity Annotation , 2022, Proc. ACM Hum. Comput. Interact..

[18] G. Karypis,et al. Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning , 2022, NAACL.

[19] Hyung Won Chung,et al. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , 2022, ICML.

[20] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[21] Andrew Zaldivar,et al. Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI , 2022, FAccT.

[22] Marc van Zee,et al. Scaling Up Models and Data with t5x and seqio , 2022, ArXiv.

[23] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[24] Noah D. Goodman,et al. STaR: Bootstrapping Reasoning With Reasoning , 2022, 2203.14465.

[25] D. Schuurmans,et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ArXiv.

[26] Swaroop Mishra,et al. How Many Data Samples is an Additional Instruction Worth? , 2022, FINDINGS.

[27] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[28] Cherepanov,et al. Competition-level code generation with AlphaCode , 2022, Science.

[29] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[30] Tanmoy Chakraborty,et al. Handling Bias in Toxic Speech Detection: A Survey , 2022, ArXiv.

[31] Quoc V. Le,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[32] Noah A. Smith,et al. Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection , 2021, NAACL.

[33] M. Lewis,et al. MetaICL: Learning to Learn In Context , 2021, NAACL.

[34] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[35] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[36] Kai-Wei Chang,et al. On Measures of Biases and Harms in NLP , 2021, AACL/IJCNLP.

[37] Rami Al-Rfou,et al. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models , 2021, Transactions of the Association for Computational Linguistics.

[38] Hannaneh Hajishirzi,et al. Cross-Task Generalization via Natural Language Crowdsourcing Instructions , 2021, ACL.

[39] Noah A. Smith,et al. Benchmarking Generalization via In-Context Instructions on 1, 600+ Language Tasks , 2022, ArXiv.

[40] Vinh Q. Tran,et al. Unifying Language Learning Paradigms , 2022, ArXiv.

[41] Rachel Rudinger,et al. Recognition of They/Them as Singular Personal Pronouns in Coreference Resolution , 2022, NAACL.

[42] S. Muresan,et al. Continual-T0: Progressively Instructing 50+ Tasks to Language Models Without Forgetting , 2022, ArXiv.

[43] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[44] David Bieber,et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[45] Sebastian Gehrmann,et al. SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets , 2021, ArXiv.

[46] Mohammad Bavarian,et al. Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[47] Eunsol Choi,et al. CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge , 2021, NeurIPS Datasets and Benchmarks.

[48] Anaelia Ovalle,et al. Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies , 2021, EMNLP.

[49] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[50] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[51] Xiang Ren,et al. CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP , 2021, EMNLP.

[52] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[53] Dan Klein,et al. Detoxifying Language Models Risks Marginalizing Minority Voices , 2021, NAACL.

[54] Dan Klein,et al. Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections , 2021, EMNLP.

[55] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[56] Timo Schick,et al. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP , 2021, Transactions of the Association for Computational Linguistics.

[57] Sonal Gupta,et al. Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[58] Jonathan Berant,et al. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[59] Zhucheng Tu,et al. Open-Domain Question Answering Goes Conversational via Question Rewriting , 2020, NAACL.

[60] Eunsol Choi,et al. QED: A Framework and Dataset for Explanations in Question Answering , 2020, Transactions of the Association for Computational Linguistics.

[61] Dawn Song,et al. Measuring Massive Multitask Language Understanding , 2020, ICLR.

[62] Quoc V. Le,et al. Searching for Efficient Transformers for Language Modeling , 2021, NeurIPS.

[63] Dinesh Garg,et al. Explanations for CommonsenseQA: New Dataset and Models , 2021, ACL.

[64] Dana Dannélls,et al. The Swedish Winogender Dataset , 2021, NODALIDA.

[65] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[66] David Patterson,et al. A domain-specific supercomputer for training deep neural networks , 2020, Commun. ACM.

[67] Jonathan Berant,et al. Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge , 2020, ArXiv.

[68] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[69] Percy Liang,et al. Graph-based, Self-Supervised Program Repair from Diagnostic Feedback , 2020, ICML.

[70] Hannaneh Hajishirzi,et al. UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[71] Eunsol Choi,et al. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[72] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[73] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[74] Ashish Sabharwal,et al. QASC: A Dataset for Question Answering via Sentence Composition , 2019, AAAI.

[75] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[76] Thomas Lukasiewicz,et al. Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations , 2019, ACL.

[77] Bill Byrne,et al. Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset , 2019, EMNLP.

[78] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[79] Richard Socher,et al. Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[80] Yue Zhang,et al. Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation , 2019, ACL.

[81] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[82] Lucy Vasserman,et al. Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[83] Inioluwa Deborah Raji,et al. Model Cards for Model Reporting , 2018, FAT.

[84] Chelsea Lee. Welcome, Singular "They" , 2019, PsycEXTRA Dataset.

[85] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[86] Thomas Lukasiewicz,et al. e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[87] Rachel Rudinger,et al. Gender Bias in Coreference Resolution , 2018, NAACL.

[88] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[89] Wang Ling,et al. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[90] Philipp Koehn,et al. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[91] Philip A.E. Brey,et al. Anticipatory Ethics for Emerging Technologies , 2012 .