Prompting GPT-3 To Be Reliable
暂无分享,去创建一个
Jordan L. Boyd-Graber | Zhe Gan | Shuohang Wang | Chenglei Si | Lijuan Wang | Jianfeng Wang | Zhengyuan Yang
[1] Tom B. Brown,et al. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , 2022, ArXiv.
[2] Dragomir R. Radev,et al. RealTime QA: What's the Answer Right Now? , 2022, NeurIPS.
[3] A. Bruckman,et al. Exploring the Role of Grammar and Word Choice in Bias Toward African American English (AAE) in Hate Speech Classification , 2022, FAccT.
[4] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, ArXiv.
[5] Christopher D. Manning,et al. Memory-Based Model Editing at Scale , 2022, ICML.
[6] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.
[7] Jordan L. Boyd-Graber,et al. Revisiting Calibration for Question Answering , 2022, ArXiv.
[8] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, ArXiv.
[9] Yang Trista Cao,et al. On the Intrinsic and Extrinsic Fairness Evaluation Metrics for Contextualized Language Representations , 2022, ACL.
[10] D. Schuurmans,et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ArXiv.
[11] M. Lewis,et al. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.
[12] Quantifying Memorization Across Neural Language Models , 2022, ArXiv.
[13] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.
[14] Jonathan Berant,et al. Unobserved Local Structures Make Compositional Generalization Hard , 2022 .
[15] Christopher D. Manning,et al. Fast Model Editing at Scale , 2021, ICLR.
[16] Phu Mon Htut,et al. BBQ: A hand-built bias benchmark for question answering , 2021, FINDINGS.
[17] Greg Durrett,et al. Can Explanations Be Useful for Calibrating Black Box Models? , 2021, Annual Meeting of the Association for Computational Linguistics.
[18] Owain Evans,et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.
[19] Edouard Grave,et al. Unsupervised Dense Information Retrieval with Contrastive Learning , 2021, Trans. Mach. Learn. Res..
[20] Po-Sen Huang,et al. Ethical and social risks of harm from Language Models , 2021, ArXiv.
[21] Zhe Gan,et al. Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.
[22] Dan Friedman,et al. Single-dataset Experts for Multi-dataset Question Answering , 2021, EMNLP.
[23] Udit Arora,et al. Types of Out-of-Distribution Texts and How to Detect Them , 2021, EMNLP.
[24] Nikhil Ramesh,et al. Entity-Based Knowledge Conflicts in Question Answering , 2021, EMNLP.
[25] Wenhu Chen,et al. A Dataset for Answering Time-Sensitive Questions , 2021, NeurIPS Datasets and Benchmarks.
[26] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.
[27] Kai-Wei Chang,et al. Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions? , 2021, FINDINGS.
[28] Li Lucy,et al. Gender and Representation Bias in GPT-3 Generated Stories , 2021, NUSE.
[29] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.
[30] Nicola De Cao,et al. Editing Factual Knowledge in Language Models , 2021, EMNLP.
[31] Dawn Song,et al. Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.
[32] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.
[33] D. Klein,et al. Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.
[34] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.
[35] Pang Wei Koh,et al. WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.
[36] Graham Neubig,et al. How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering , 2020, Transactions of the Association for Computational Linguistics.
[37] Nicola De Cao,et al. KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.
[38] Ting Liu,et al. Benchmarking Robustness of Machine Reading Comprehension Models , 2020, FINDINGS.
[39] Siva Reddy,et al. StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.
[40] Hwee Tou Ng,et al. Do Multi-Hop Question Answering Systems Know How to Answer the Single-Hop Sub-Questions? , 2020, EACL.
[41] Maosong Sun,et al. Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning , 2020, ArXiv.
[42] Tal Linzen,et al. COGS: A Compositional Generalization Challenge Based on Semantic Interpretation , 2020, EMNLP.
[43] Samuel R. Bowman,et al. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models , 2020, EMNLP.
[44] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.
[45] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[46] Shafiq R. Joty,et al. It’s Morphin’ Time! Combating Linguistic Discrimination with Inflectional Perturbations , 2020, ACL.
[47] Sameer Singh,et al. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.
[48] Dawn Song,et al. Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.
[49] Danqi Chen,et al. Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.
[50] Noah A. Smith,et al. Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.
[51] Xipeng Qiu,et al. BERT-ATTACK: Adversarial Attack against BERT Using BERT , 2020, EMNLP.
[52] Shrey Desai,et al. Calibration of Pre-trained Transformers , 2020, EMNLP.
[53] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[54] Colin Raffel,et al. How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.
[55] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[56] Xiao Wang,et al. Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , 2019, ICLR.
[57] Joey Tianyi Zhou,et al. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment , 2019, AAAI.
[58] Zhucheng Tu,et al. An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering , 2019, EMNLP.
[59] Danqi Chen,et al. MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, EMNLP.
[60] Sebastian Riedel,et al. Language Models as Knowledge Bases? , 2019, EMNLP.
[61] Sameer Singh,et al. Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.
[62] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[63] Jonathan Berant,et al. MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension , 2019, ACL.
[64] Jason Baldridge,et al. PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.
[65] R. Thomas McCoy,et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.
[66] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[67] Yoshua Bengio,et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.
[68] Carlos Guestrin,et al. Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.
[69] Rachel Rudinger,et al. Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.
[70] Rachel Rudinger,et al. Gender Bias in Coreference Resolution , 2018, NAACL.
[71] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[72] Jieyu Zhao,et al. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods , 2018, NAACL.
[73] Andreas Vlachos,et al. FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.
[74] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.
[75] Percy Liang,et al. Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.
[76] Luke S. Zettlemoyer,et al. End-to-end Neural Coreference Resolution , 2017, EMNLP.
[77] Kilian Q. Weinberger,et al. On Calibration of Modern Neural Networks , 2017, ICML.
[78] Omer Levy,et al. Zero-Shot Relation Extraction via Reading Comprehension , 2017, CoNLL.
[79] John Schulman,et al. Concrete Problems in AI Safety , 2016, ArXiv.
[80] Milos Hauskrecht,et al. Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.
[81] John Platt,et al. Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .
[82] G. Brier. VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .