Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right

Large language models have shown promising results in zero-shot settings. For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition—wherein different surface forms compete for probability mass, even if they represent the same underlying concept in a given context, e.g. “computer” and “PC.” Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to its a priori likelihood within the context of a specific task. It achieves consistent gains in zero-shot performance over both calibrated and uncalibrated scoring functions on all GPT-2 and GPT-3 models on a variety of multiple choice datasets.

[1]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[2]  Douwe Kiela,et al.  True Few-Shot Learning with Language Models , 2021, NeurIPS.

[3]  S. Riedel,et al.  Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity , 2021, ACL.

[4]  Xiang Ren,et al.  CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP , 2021, EMNLP.

[5]  Dan Klein,et al.  Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections , 2021, EMNLP.

[6]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[7]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[8]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[9]  Karen Hambardzumyan,et al.  WARP: Word-level Adversarial ReProgramming , 2021, ACL.

[10]  Hinrich Schutze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[11]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[12]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[13]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[14]  Frank F. Xu,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[15]  Shujie Liu,et al.  Unsupervised Context Rewriting for Open Domain Conversation , 2019, EMNLP.

[16]  Teven Le Scao,et al.  Transformers: State-of-the-Art Natural Language Processing , 2019, EMNLP.

[17]  Kyunghyun Cho,et al.  Neural Machine Translation with Byte-Level Subwords , 2019, AAAI.

[18]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[19]  Garrison W. Cottrell,et al.  Improving Neural Story Generation by Targeted Common Sense Grounding , 2019, EMNLP.

[20]  Erik Mueller,et al.  DLGNet: A Transformer-based Model for Dialogue Response Generation , 2019, NLP4CONVAI.

[21]  Judith Tonhauser,et al.  The CommitmentBank: Investigating projection in naturally occurring discourse , 2019 .

[22]  Eric P. Xing,et al.  Target-Guided Open-Domain Conversation , 2019, ACL.

[23]  André F. T. Martins,et al.  Sparse Sequence-to-Sequence Models , 2019, ACL.

[24]  Ming-Wei Chang,et al.  BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions , 2019, NAACL.

[25]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[26]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[27]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[28]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[29]  Roger Levy,et al.  Communicative Efficiency, Uniform Information Density, and the Rational Speech Act Theory , 2018, CogSci.

[30]  Katrin Erk,et al.  Modeling Semantic Plausibility by Injecting World Knowledge , 2018, NAACL.

[31]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[32]  Dongyan Zhao,et al.  Towards Implicit Content-Introducing for Generative Short-Text Conversation Systems , 2017, EMNLP.

[33]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[34]  Nathanael Chambers,et al.  LSDSem 2017 Shared Task: The Story Cloze Test , 2017, LSDSem@EACL.

[35]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[36]  Rui Yan,et al.  Sequence to Backward and Forward Sequences: A Content-Introducing Approach to Generative Short-Text Conversation , 2016, COLING.

[37]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[38]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[39]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[40]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[41]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Benjamin Van Durme,et al.  Reporting bias and knowledge acquisition , 2013, AKBC '13.

[43]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[44]  Zornitsa Kozareva,et al.  Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[45]  Katrin Erk,et al.  A Flexible, Corpus-Driven Model of Regular and Inverse Selectional Preferences , 2010, CL.

[46]  Patrick Pantel,et al.  ISP: Learning Inferential Selectional Preferences , 2007, NAACL.

[47]  Roger Levy,et al.  Speakers optimize information density through syntactic reduction , 2006, NIPS.

[48]  Ruud van der Helm,et al.  Towards a clarification of probability, possibility and plausibility: how semantics could help futures practice to improve , 2006 .

[49]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[50]  Hinrich Schutze,et al.  Few-Shot Text Generation with Pattern-Exploiting Training , 2020, ArXiv.

[51]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[52]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[53]  Ido Dagan,et al.  The PASCAL Recognising Textual Entailment Challenge , 2005, MLCW.