Prompting GPT-3 To Be Reliable

Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, existing research focus on models’ accuracy on standard benchmarks and largely ignore their reliability, which is crucial for avoiding catastrophic real-world harms. While reliability is a broad and vaguely defined term, this work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We establish simple and effective prompts to demonstrate GPT-3’s reliability in these four aspects: 1) generalize out-of-domain, 2) balance demographic distribution to reduce social biases, 3) calibrate language model probabilities, and 4) update the LLM’s knowledge. We find that by employing appropriate prompts, GPT-3 outperforms smaller-scale supervised models by large margins on all these facets. We release all processed datasets, evaluation scripts, and model predictions to facilitate future analysis. 1 Our findings not only shed new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use large language models like GPT-3.

[1]  Tom B. Brown,et al.  Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , 2022, ArXiv.

[2]  Dragomir R. Radev,et al.  RealTime QA: What's the Answer Right Now? , 2022, NeurIPS.

[3]  A. Bruckman,et al.  Exploring the Role of Grammar and Word Choice in Bias Toward African American English (AAE) in Hate Speech Classification , 2022, FAccT.

[4]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, ArXiv.

[5]  Christopher D. Manning,et al.  Memory-Based Model Editing at Scale , 2022, ICML.

[6]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[7]  Jordan L. Boyd-Graber,et al.  Revisiting Calibration for Question Answering , 2022, ArXiv.

[8]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, ArXiv.

[9]  Yang Trista Cao,et al.  On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations , 2022, ACL.

[10]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ArXiv.

[11]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[12]  Quantifying Memorization Across Neural Language Models , 2022, ArXiv.

[13]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[14]  Jonathan Berant,et al.  Unobserved Local Structures Make Compositional Generalization Hard , 2022 .

[15]  Christopher D. Manning,et al.  Fast Model Editing at Scale , 2021, ICLR.

[16]  Phu Mon Htut,et al.  BBQ: A hand-built bias benchmark for question answering , 2021, FINDINGS.

[17]  Greg Durrett,et al.  Can Explanations Be Useful for Calibrating Black Box Models? , 2021, Annual Meeting of the Association for Computational Linguistics.

[18]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[19]  Edouard Grave,et al.  Unsupervised Dense Information Retrieval with Contrastive Learning , 2021, Trans. Mach. Learn. Res..

[20]  Po-Sen Huang,et al.  Ethical and social risks of harm from Language Models , 2021, ArXiv.

[21]  Zhe Gan,et al.  Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.

[22]  Dan Friedman,et al.  Single-dataset Experts for Multi-dataset Question Answering , 2021, EMNLP.

[23]  Udit Arora,et al.  Types of Out-of-Distribution Texts and How to Detect Them , 2021, EMNLP.

[24]  Nikhil Ramesh,et al.  Entity-Based Knowledge Conflicts in Question Answering , 2021, EMNLP.

[25]  Wenhu Chen,et al.  A Dataset for Answering Time-Sensitive Questions , 2021, NeurIPS Datasets and Benchmarks.

[26]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[27]  Kai-Wei Chang,et al.  Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions? , 2021, FINDINGS.

[28]  Li Lucy,et al.  Gender and Representation Bias in GPT-3 Generated Stories , 2021, NUSE.

[29]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[30]  Nicola De Cao,et al.  Editing Factual Knowledge in Language Models , 2021, EMNLP.

[31]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[32]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[33]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[34]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[35]  Pang Wei Koh,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[36]  Graham Neubig,et al.  How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering , 2020, Transactions of the Association for Computational Linguistics.

[37]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[38]  Ting Liu,et al.  Benchmarking Robustness of Machine Reading Comprehension Models , 2020, FINDINGS.

[39]  Siva Reddy,et al.  StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[40]  Hwee Tou Ng,et al.  Do Multi-Hop Question Answering Systems Know How to Answer the Single-Hop Sub-Questions? , 2020, EACL.

[41]  Maosong Sun,et al.  Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning , 2020, ArXiv.

[42]  Tal Linzen,et al.  COGS: A Compositional Generalization Challenge Based on Semantic Interpretation , 2020, EMNLP.

[43]  Samuel R. Bowman,et al.  CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models , 2020, EMNLP.

[44]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[45]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[46]  Shafiq R. Joty,et al.  It’s Morphin’ Time! Combating Linguistic Discrimination with Inflectional Perturbations , 2020, ACL.

[47]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[48]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[49]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[50]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[51]  Xipeng Qiu,et al.  BERT-ATTACK: Adversarial Attack against BERT Using BERT , 2020, EMNLP.

[52]  Shrey Desai,et al.  Calibration of Pre-trained Transformers , 2020, EMNLP.

[53]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[54]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[55]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[56]  Xiao Wang,et al.  Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , 2019, ICLR.

[57]  Joey Tianyi Zhou,et al.  Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2019, AAAI.

[58]  Zhucheng Tu,et al.  An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering , 2019, EMNLP.

[59]  Danqi Chen,et al.  MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, EMNLP.

[60]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[61]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[62]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[63]  Jonathan Berant,et al.  MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension , 2019, ACL.

[64]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[65]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[66]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[67]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[68]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[69]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[70]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[71]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[72]  Jieyu Zhao,et al.  Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.

[73]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[74]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[75]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[76]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[77]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[78]  Omer Levy,et al.  Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.

[79]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[80]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[81]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[82]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .