How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering
暂无分享,去创建一个
[1] Matthew Richardson,et al. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.
[2] Jihoon Kim,et al. Calibrating predictive model estimates to support personalized medicine , 2011, J. Am. Medical Informatics Assoc..
[3] Marc'Aurelio Ranzato,et al. Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.
[4] Tianqi Chen,et al. XGBoost: A Scalable Tree Boosting System , 2016, KDD.
[5] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[6] Fabio Petroni,et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.
[7] Oren Etzioni,et al. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.
[8] Sebastian Riedel,et al. Language Models as Knowledge Bases? , 2019, EMNLP.
[9] Guokun Lai,et al. RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.
[10] Noah A. Smith,et al. Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning , 2019, EMNLP.
[11] Ming-Wei Chang,et al. REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.
[12] Peter Clark,et al. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.
[13] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[14] Hannaneh Hajishirzi,et al. UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.
[15] Yejin Choi,et al. Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.
[16] Christopher D. Manning,et al. A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.
[17] David Chiang,et al. Correcting Length Bias in Neural Machine Translation , 2018, WMT.
[18] Rachel Rudinger,et al. Gender Bias in Coreference Resolution , 2018, NAACL.
[19] Tushar Khot,et al. QASC: A Dataset for Question Answering via Sentence Composition , 2020, AAAI.
[20] Armen Aghajanyan,et al. Pre-training via Paraphrasing , 2020, NeurIPS.
[21] Shrey Desai,et al. Calibration of Pre-trained Transformers , 2020, EMNLP.
[22] Percy Liang,et al. Fairness Without Demographics in Repeated Loss Minimization , 2018, ICML.
[23] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[24] Percy Liang,et al. Selective Question Answering under Domain Shift , 2020, ACL.
[25] Thomas Lukasiewicz,et al. A Surprisingly Robust Trick for the Winograd Schema Challenge , 2019, ACL.
[26] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[27] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[28] Myle Ott,et al. Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.
[29] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[30] Chao Zhang,et al. Calibrated Fine-Tuning for Pre-trained Language Models via Manifold Smoothing , 2020, EMNLP.
[31] Nicola De Cao,et al. KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.
[32] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[33] Yejin Choi,et al. PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.
[34] Kilian Q. Weinberger,et al. On Calibration of Modern Neural Networks , 2017, ICML.
[35] Yunfeng Zhang,et al. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making , 2020, FAT*.
[36] Colin Raffel,et al. How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.
[37] Yejin Choi,et al. An Adversarial Winograd Schema Challenge at Scale , 2019 .
[38] Kibok Lee,et al. Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.
[39] Graham Neubig,et al. X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models , 2020, EMNLP.
[40] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Mirella Lapata,et al. Confidence Modeling for Neural Semantic Parsing , 2018, ACL.
[42] Jason Weston,et al. Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.
[43] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[44] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[45] Hong Yu,et al. Calibrating Structured Output Predictors for Natural Language Processing , 2020, ACL.
[46] Sameer Singh,et al. Do NLP Models Know Numbers? Probing Numeracy in Embeddings , 2019, EMNLP.
[47] Yejin Choi,et al. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.
[48] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[49] Jonathan Berant,et al. Injecting Numerical Reasoning Skills into Language Models , 2020, ACL.
[50] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[51] Quoc V. Le,et al. A Simple Method for Commonsense Reasoning , 2018, ArXiv.
[52] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.
[53] Ulli Waltinger,et al. BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA , 2019, ArXiv.
[54] Fabio Petroni,et al. How Context Affects Language Models' Factual Predictions , 2020, AKBC.
[55] Rahul Khanna,et al. Can BERT Reason? Logically Equivalent Probes for Evaluating the Inference Capabilities of Language Models , 2020, ArXiv.
[56] Kevin Lin,et al. Reasoning Over Paragraph Effects in Situations , 2019, MRQA@EMNLP.
[57] Graham Neubig,et al. How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.
[58] Jonathan Berant,et al. oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.
[59] Yoshua Bengio,et al. A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..
[60] Dawn Song,et al. Measuring Massive Multitask Language Understanding , 2020, ICLR.
[61] Philip Bachman,et al. NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.
[62] Jonathan Berant,et al. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.
[63] Steven Schockaert,et al. Inducing Relational Knowledge from BERT , 2019, AAAI.
[64] Yejin Choi,et al. WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.