Large Language Models Can Be Easily Distracted by Irrelevant Context

Large language models have achieved impressive performance on various natural language processing tasks. However, so far they have been evaluated primarily on benchmarks where all information in the input context is relevant for solving the task. In this work, we investigate the distractibility of large language models, i.e., how the model problem-solving accuracy can be influenced by irrelevant context. In particular, we introduce Grade-School Math with Irrelevant Context (GSM-IC), an arithmetic reasoning dataset with irrelevant information in the problem description. We use this benchmark to measure the distractibility of cutting-edge prompting techniques for large language models, and find that the model performance is dramatically decreased when irrelevant information is included. We also identify several approaches for mitigating this deficiency, such as decoding with self-consistency and adding to the prompt an instruction that tells the language model to ignore the irrelevant information.

[1]  Luke Zettlemoyer,et al.  Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters , 2022, ACL.

[2]  William W. Cohen,et al.  Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[3]  Jamie Callan,et al.  PAL: Program-aided Language Models , 2022, ICML.

[4]  Christopher D. Manning,et al.  Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.

[5]  M. Zaheer,et al.  Large Language Models with Controllable Working Memory , 2022, ACL.

[6]  Matt Gardner,et al.  CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation , 2022, EMNLP.

[7]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[8]  Quoc V. Le,et al.  Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[9]  Noah A. Smith,et al.  Measuring and Narrowing the Compositionality Gap in Language Models , 2022, ArXiv.

[10]  Hyung Won Chung,et al.  Language Models are Multilingual Chain-of-Thought Reasoners , 2022, ICLR.

[11]  Allyson Ettinger,et al.  COMPS: Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models , 2022, EACL.

[12]  Ashish Sabharwal,et al.  Decomposed Prompting: A Modular Approach for Solving Complex Tasks , 2022, ICLR.

[13]  Xinyun Chen,et al.  Compositional Semantic Parsing with Large Language Models , 2022, ArXiv.

[14]  Aman Madaan,et al.  Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango , 2022, ArXiv.

[15]  Shafiq R. Joty,et al.  FOLIO: Natural Language Reasoning with First-Order Logic , 2022, ArXiv.

[16]  Raphael Gontijo Lopes,et al.  Language Model Cascades , 2022, ArXiv.

[17]  D. Schuurmans,et al.  Rationale-Augmented Ensembles in Language Models , 2022, ArXiv.

[18]  Kang Min Yoo,et al.  Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations , 2022, EMNLP.

[19]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[20]  I. Higgins,et al.  Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning , 2022, ICLR.

[21]  Sida I. Wang,et al.  Natural Language to Code Translation with Execution , 2022, EMNLP.

[22]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[23]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[24]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[25]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[26]  J. Steinhardt,et al.  Capturing Failures of Large Language Models via Human Cognitive Biases , 2022, NeurIPS.

[27]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[28]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[29]  Zhe Gan,et al.  Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models , 2021, NeurIPS Datasets and Benchmarks.

[30]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[31]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[32]  Allyson Ettinger,et al.  Sorting through the noise: Testing robustness of information processing in pre-trained language models , 2021, EMNLP.

[33]  Vikram Pudi,et al.  Adversarial Examples for Evaluating Math Word Problem Solvers , 2021, EMNLP.

[34]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[35]  Ellie Pavlick,et al.  Do Prompt-Based Models Really Understand the Meaning of Their Prompts? , 2021, NAACL.

[36]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[37]  Navin Goyal,et al.  Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[38]  Peter Clark,et al.  ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language , 2020, FINDINGS.

[39]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[40]  John X. Morris,et al.  TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP , 2020, EMNLP.

[41]  Oyvind Tafjord,et al.  Transformers as Soft Reasoners over Language , 2020, IJCAI.

[42]  Hinrich Schütze,et al.  Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly , 2019, ACL.

[43]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[44]  Joelle Pineau,et al.  CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text , 2019, EMNLP.

[45]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[46]  Weizhu Chen,et al.  Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering , 2018, NAACL.

[47]  Yuning Jiang,et al.  Learning Visually-Grounded Semantics from Contrastive Adversarial Samples , 2018, COLING.

[48]  Dan Roth,et al.  Learning What is Essential in Questions , 2017, CoNLL.

[49]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[50]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[51]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[52]  Daniela Lucangeli,et al.  The Disturbing Effect of Irrelevant Information on Arithmetic Problem Solving in Inattentive Children , 2002, Developmental neuropsychology.

[53]  Cesare Cornoldi,et al.  Working memory and intrusions of irrelevant information in a group of specific poor problem solvers , 1999, Memory & cognition.

[54]  W J Hoyer,et al.  Effects of varying irrelevant information on adult age differences in problem solving. , 1979, Journal of gerontology.

[55]  Allyson Ettinger,et al.  COMPS: Conceptual Minimal Pair Sentences for testing Property Knowledge and Inheritance in Pre-trained Language Models , 2022, ArXiv.

[56]  R. Chaves,et al.  Look at that! BERT can be easily distracted from paying attention to morphosyntax , 2021, SCIL.

[57]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .