Leveraging Large Language Models for Multiple Choice Question Answering

While large language models (LLMs) like GPT-3 have achieved impressive results on multiple choice question answering (MCQA) tasks in the zero, one, and few-shot settings, they generally lag behind the MCQA state of the art (SOTA). MCQA tasks have traditionally been presented to LLMs like cloze tasks. An LLM is conditioned on a question (without the associated answer options) and its chosen option is the one assigned the highest probability after normalization (for length, etc.). A more natural prompting approach is to present the question and answer options to the LLM jointly and have it output the symbol (e.g.,"A") associated with its chosen answer option. This approach allows the model to explicitly compare answer options, reduces computational costs, and mitigates the effects of tokenization scheme and answer option representations on answer selection. For the natural approach to be effective, the LLM it is used with must be able to associate answer options with the symbols that represent them. The LLM needs what we term multiple choice symbol binding (MCSB) ability. This ability varies greatly by model. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated.

[1]  Jane A. Yu,et al.  Few-shot Learning with Retrieval Augmented Language Models , 2022, J. Mach. Learn. Res..

[2]  O. Winther,et al.  Can large language models reason about medical questions? , 2022, Patterns.

[3]  Tom B. Brown,et al.  Language Models (Mostly) Know What They Know , 2022, ArXiv.

[4]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[5]  Graham Neubig,et al.  Testing the Ability of Language Models to Interpret Figurative Language , 2022, NAACL.

[6]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[7]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[8]  Ankit Pal,et al.  MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , 2022, CHIL.

[9]  D. Wingate,et al.  An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels , 2022, ACL.

[10]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[11]  Liqiang Nie,et al.  MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning , 2022, FINDINGS.

[12]  J. Dean,et al.  ST-MoE: Designing Stable and Transferable Sparse Expert Models , 2022, 2202.08906.

[13]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[14]  Quoc V. Le,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[15]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[16]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[17]  H. Yamana,et al.  HRCA+: Advanced Multiple-choice Machine Reading Comprehension Method , 2022, LREC.

[18]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[19]  M. Samwald,et al.  GPT-3 Models are Poor Few-Shot Learners in the Biomedical Domain , 2021, ArXiv.

[20]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[21]  Douwe Kiela,et al.  True Few-Shot Learning with Language Models , 2021, NeurIPS.

[22]  Luke Zettlemoyer,et al.  Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , 2021, EMNLP.

[23]  Yejin Choi,et al.  UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark , 2021, AAAI.

[24]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[25]  Tom Henighan,et al.  Scaling Laws for Transfer , 2021, ArXiv.

[26]  Bill Yuchen Lin,et al.  RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge , 2021, FINDINGS.

[27]  Yu Cheng,et al.  InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective , 2020, ICLR.

[28]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[29]  Peng Meng,et al.  Improving Machine Reading Comprehension with Single-choice Decision and Transfer Learning , 2020, ArXiv.

[30]  Mark Chen,et al.  Scaling Laws for Autoregressive Generative Modeling , 2020, ArXiv.

[31]  Hanmeng Liu,et al.  LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning , 2020, IJCAI.

[32]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[33]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[34]  Ronan Le Bras,et al.  G-DAug: Generative Data Augmentation for Commonsense Reasoning , 2020, FINDINGS.

[35]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[36]  Steven Schockaert,et al.  Inducing Relational Knowledge from BERT , 2019, AAAI.

[37]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[38]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[39]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[40]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[41]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[42]  Ronan Le Bras,et al.  WinoGrande , 2019, AAAI.

[43]  Yejin Choi,et al.  Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , 2019, EMNLP.

[44]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[45]  Yejin Choi,et al.  COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[46]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[47]  Claire Cardie,et al.  DREAM: A Challenge Data Set and Models for Dialogue-Based Reading Comprehension , 2019, TACL.

[48]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[49]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[50]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Changsheng Xu,et al.  Representation Learning of Knowledge Graphs with Entity Attributes and Multimedia Descriptions , 2018, 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM).

[53]  Yoav Goldberg,et al.  Word Sense Induction with Neural biLM and Symmetric Patterns , 2018, EMNLP.

[54]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[55]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[56]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[57]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[58]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[59]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[60]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[61]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[62]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[63]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[64]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[65]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .

[66]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .