论文信息 - Prompting GPT-3 To Be Reliable - 字舞流文

Prompting GPT-3 To Be Reliable

Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, existing research focus on models’ accuracy on standard benchmarks and largely ignore their reliability, which is crucial for avoiding catastrophic real-world harms. While reliability is a broad and vaguely defined term, this work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We establish simple and effective prompts to demonstrate GPT-3’s reliability in these four aspects: 1) generalize out-of-domain, 2) balance demographic distribution to reduce social biases, 3) calibrate language model probabilities, and 4) update the LLM’s knowledge. We find that by employing appropriate prompts, GPT-3 outperforms smaller-scale supervised models by large margins on all these facets. We release all processed datasets, evaluation scripts, and model predictions to facilitate future analysis. 1 Our findings not only shed new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use large language models like GPT-3.

Jordan L. Boyd-Graber | Zhe Gan | Shuohang Wang | Chenglei Si | Lijuan Wang | Jianfeng Wang | Zhengyuan Yang

[1] Tom B. Brown,et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , 2022, ArXiv.

[2] Dragomir R. Radev,et al. RealTime QA: What's the Answer Right Now? , 2022, NeurIPS.

[3] A. Bruckman,et al. Exploring the Role of Grammar and Word Choice in Bias Toward African American English (AAE) in Hate Speech Classification , 2022, FAccT.

[4] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, ArXiv.

[5] Christopher D. Manning,et al. Memory-Based Model Editing at Scale , 2022, ICML.

[6] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[7] Jordan L. Boyd-Graber,et al. Revisiting Calibration for Question Answering , 2022, ArXiv.

[8] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, ArXiv.

[9] Yang Trista Cao,et al. On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations , 2022, ACL.

[10] D. Schuurmans,et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ArXiv.

[11] M. Lewis,et al. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[12] Quantifying Memorization Across Neural Language Models , 2022, ArXiv.

[13] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[14] Jonathan Berant,et al. Unobserved Local Structures Make Compositional Generalization Hard , 2022 .

[15] Christopher D. Manning,et al. Fast Model Editing at Scale , 2021, ICLR.

[16] Phu Mon Htut,et al. BBQ: A hand-built bias benchmark for question answering , 2021, FINDINGS.

[17] Greg Durrett,et al. Can Explanations Be Useful for Calibrating Black Box Models? , 2021, Annual Meeting of the Association for Computational Linguistics.

[18] Owain Evans,et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[19] Edouard Grave,et al. Unsupervised Dense Information Retrieval with Contrastive Learning , 2021, Trans. Mach. Learn. Res..

[20] Po-Sen Huang,et al. Ethical and social risks of harm from Language Models , 2021, ArXiv.

[21] Zhe Gan,et al. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.

[22] Dan Friedman,et al. Single-dataset Experts for Multi-dataset Question Answering , 2021, EMNLP.

[23] Udit Arora,et al. Types of Out-of-Distribution Texts and How to Detect Them , 2021, EMNLP.

[24] Nikhil Ramesh,et al. Entity-Based Knowledge Conflicts in Question Answering , 2021, EMNLP.

[25] Wenhu Chen,et al. A Dataset for Answering Time-Sensitive Questions , 2021, NeurIPS Datasets and Benchmarks.

[26] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[27] Kai-Wei Chang,et al. Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions? , 2021, FINDINGS.

[28] Li Lucy,et al. Gender and Representation Bias in GPT-3 Generated Stories , 2021, NUSE.

[29] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[30] Nicola De Cao,et al. Editing Factual Knowledge in Language Models , 2021, EMNLP.

[31] Dawn Song,et al. Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[32] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[33] D. Klein,et al. Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[34] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[35] Pang Wei Koh,et al. WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[36] Graham Neubig,et al. How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering , 2020, Transactions of the Association for Computational Linguistics.

[37] Nicola De Cao,et al. KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[38] Ting Liu,et al. Benchmarking Robustness of Machine Reading Comprehension Models , 2020, FINDINGS.

[39] Siva Reddy,et al. StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[40] Hwee Tou Ng,et al. Do Multi-Hop Question Answering Systems Know How to Answer the Single-Hop Sub-Questions? , 2020, EACL.

[41] Maosong Sun,et al. Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning , 2020, ArXiv.

[42] Tal Linzen,et al. COGS: A Compositional Generalization Challenge Based on Semantic Interpretation , 2020, EMNLP.

[43] Samuel R. Bowman,et al. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models , 2020, EMNLP.

[44] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[45] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[46] Shafiq R. Joty,et al. It’s Morphin’ Time! Combating Linguistic Discrimination with Inflectional Perturbations , 2020, ACL.

[47] Sameer Singh,et al. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[48] Dawn Song,et al. Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[49] Danqi Chen,et al. Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[50] Noah A. Smith,et al. Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[51] Xipeng Qiu,et al. BERT-ATTACK: Adversarial Attack against BERT Using BERT , 2020, EMNLP.

[52] Shrey Desai,et al. Calibration of Pre-trained Transformers , 2020, EMNLP.

[53] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[54] Colin Raffel,et al. How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[55] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[56] Xiao Wang,et al. Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , 2019, ICLR.

[57] Joey Tianyi Zhou,et al. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2019, AAAI.

[58] Zhucheng Tu,et al. An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering , 2019, EMNLP.

[59] Danqi Chen,et al. MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, EMNLP.

[60] Sebastian Riedel,et al. Language Models as Knowledge Bases? , 2019, EMNLP.

[61] Sameer Singh,et al. Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[62] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[63] Jonathan Berant,et al. MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension , 2019, ACL.

[64] Jason Baldridge,et al. PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[65] R. Thomas McCoy,et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[66] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[67] Yoshua Bengio,et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[68] Carlos Guestrin,et al. Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[69] Rachel Rudinger,et al. Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[70] Rachel Rudinger,et al. Gender Bias in Coreference Resolution , 2018, NAACL.

[71] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[72] Jieyu Zhao,et al. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[73] Andreas Vlachos,et al. FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[74] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[75] Percy Liang,et al. Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[76] Luke S. Zettlemoyer,et al. End-to-end Neural Coreference Resolution , 2017, EMNLP.

[77] Kilian Q. Weinberger,et al. On Calibration of Modern Neural Networks , 2017, ICML.

[78] Omer Levy,et al. Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[79] John Schulman,et al. Concrete Problems in AI Safety , 2016, ArXiv.

[80] Milos Hauskrecht,et al. Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[81] John Platt,et al. Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[82] G. Brier. VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .