Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on"counterfactual"task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.

[1]  Nicolas Le Roux,et al.  Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference , 2023, ArXiv.

[2]  Khyathi Raghavi Chandu,et al.  How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources , 2023, ArXiv.

[3]  Akiko Aizawa,et al.  Probing Physical Reasoning with Counter-Commonsense Context , 2023, ACL.

[4]  Neel Joshi,et al.  Controllable Text-to-Image Generation with GPT-4 , 2023, ArXiv.

[5]  Ronan Le Bras,et al.  Faith and Fate: Limits of Transformers on Compositionality , 2023, ArXiv.

[6]  Song-Chun Zhu,et al.  Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners , 2023, ArXiv.

[7]  Shay B. Cohen,et al.  The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python , 2023, ACL.

[8]  Ashish Sabharwal,et al.  IfQA: A Dataset for Open-domain Question Answering under Counterfactual Presuppositions , 2023, EMNLP.

[9]  Yilun Du,et al.  Improving Factuality and Reasoning in Language Models through Multiagent Debate , 2023, ArXiv.

[10]  R. Levy,et al.  Prompt-based methods may underestimate large language models' linguistic generalizations , 2023, ArXiv.

[11]  Emre Kıcıman,et al.  Causal Reasoning and Large Language Models: Opening a New Frontier for Causality , 2023, ArXiv.

[12]  Z. Ren,et al.  Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent , 2023, ArXiv.

[13]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[14]  Anton Firc,et al.  On the Educational Impact of ChatGPT: Is Artificial Intelligence Ready to Obtain a University Degree? , 2023, ITiCSE.

[15]  S. S. Gill,et al.  Mind meets machine: Unravelling GPT-4's cognitive psychology , 2023, BenchCouncil Transactions on Benchmarks, Standards and Evaluations.

[16]  Shay B. Cohen,et al.  BERT Is Not The Count: Learning to Match Mathematical Statements with Proofs , 2023, EACL.

[17]  Anders Søgaard,et al.  Implications of the Convergence of Language and Vision Model Geometries , 2023, ArXiv.

[18]  Anders Søgaard Grounding the Vector Space of an Octopus: Word Meaning from Raw Text , 2023, Minds and Machines.

[19]  J. Petke,et al.  An Analysis of the Automatic Bug Fixing Performance of ChatGPT , 2023, 2023 IEEE/ACM International Workshop on Automated Program Repair (APR).

[20]  Anna A. Ivanova,et al.  Dissociating language and thought in large language models: a cognitive perspective , 2023, ArXiv.

[21]  Bowen Zhang,et al.  How would Stance Detection Techniques Evolve after the Launch of ChatGPT? , 2022, ArXiv.

[22]  William W. Cohen,et al.  Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[23]  David Bau,et al.  Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , 2022, ICLR.

[24]  He He,et al.  Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought , 2022, ICLR.

[25]  Yuhuai Wu,et al.  Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[26]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[27]  T. Poibeau,et al.  Probing for the Usage of Grammatical Number , 2022, ACL.

[28]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[29]  Jacob Eisenstein Informativeness and Invariance: Two Perspectives on Spurious Correlations in Natural Language , 2022, NAACL.

[30]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[31]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[32]  Huan Sun,et al.  Iteratively Prompt Pre-trained Language Models for Chain of Thought , 2022, EMNLP.

[33]  Roy Schwartz,et al.  Data Contamination: From Memorization to Exploitation , 2022, ACL.

[34]  Frank F. Xu,et al.  A systematic evaluation of large language models of code , 2022, MAPS@PLDI.

[35]  Jane A. Yu,et al.  Quantifying Adaptability in Pre-trained Language Models with 500 Tasks , 2021, NAACL.

[36]  Noah D. Goodman,et al.  Inducing Causal Structure for Interpretable Neural Networks , 2021, ICML.

[37]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[38]  Anders Sogaard,et al.  Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , 2021, CONLL.

[39]  Tiago Pimentel,et al.  A Bayesian Framework for Information-Theoretic Probing , 2021, EMNLP.

[40]  Anna Maria Di Sciullo On Aspects of the Theory of Syntax , 2021, Inference: International Review of Science.

[41]  Noah D. Goodman,et al.  A counterfactual simulation model of causal judgments for physical events. , 2021, Psychological review.

[42]  Christopher Potts,et al.  Causal Abstractions of Neural Networks , 2021, NeurIPS.

[43]  Jacob Andreas,et al.  Implicit Representations of Meaning in Neural Language Models , 2021, ACL.

[44]  Abulhair Saparov,et al.  Towards General Natural Language Understanding with Probabilistic Worldbuilding , 2021, TACL.

[45]  Hannaneh Hajishirzi,et al.  Cross-Task Generalization via Natural Language Crowdsourcing Instructions , 2021, ACL.

[46]  Jesse Dodge,et al.  Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus , 2021, EMNLP.

[47]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[48]  Yonatan Belinkov,et al.  Probing Classifiers: Promises, Shortcomings, and Advances , 2021, CL.

[49]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[50]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[51]  Peter Clark,et al.  ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language , 2020, FINDINGS.

[52]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[53]  Yoav Goldberg,et al.  Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals , 2020, Transactions of the Association for Computational Linguistics.

[54]  Leon Bergen,et al.  Word Frequency Does Not Predict Grammatical Knowledge in Language Models , 2020, EMNLP.

[55]  Dan Roth,et al.  Do Language Embeddings capture Scales? , 2020, BLACKBOXNLP.

[56]  Zachary Chase Lipton,et al.  Explaining The Efficacy of Counterfactually-Augmented Data , 2020, ICLR.

[57]  Ulrich Paquet,et al.  Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess , 2020, ArXiv.

[58]  Tie-Yan Liu,et al.  PopMAG: Pop Music Accompaniment Generation , 2020, ACM Multimedia.

[59]  Stan Matwin,et al.  SemEval-2020 Task 5: Counterfactual Recognition , 2020, SEMEVAL.

[60]  J. Tenenbaum,et al.  Bayesian Models of Conceptual Development: Learning as Building Models of the World , 2020, Annual Review of Developmental Psychology.

[61]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[62]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[63]  Ali Farhadi,et al.  Probing Contextual Language Models for Common Ground with Visual Representations , 2020, NAACL.

[64]  Marco Baroni,et al.  Syntactic Structure from Deep Learning , 2020, Annual Review of Linguistics.

[65]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[66]  Oyvind Tafjord,et al.  Transformers as Soft Reasoners over Language , 2020, IJCAI.

[67]  Zachary Chase Lipton,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2019, ICLR.

[68]  Yejin Choi,et al.  Counterfactual Story Reasoning and Generation , 2019, EMNLP.

[69]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[70]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[71]  Leon Hong,et al.  Approachable Music Composition with Machine Learning at Scale , 2019, ISMIR.

[72]  Douglas Eck,et al.  Counterpoint by Convolution , 2019, ISMIR.

[73]  Yoav Goldberg,et al.  Studying the Inductive Biases of RNNs with Synthetic Variations of Natural Languages , 2019, NAACL.

[74]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[75]  Noah D. Goodman,et al.  Eye-Tracking Causality , 2017, Psychological science.

[76]  Ilya Shpitser,et al.  Fair Inference on Outcomes , 2017, AAAI.

[77]  Matt J. Kusner,et al.  Counterfactual Fairness , 2017, NIPS.

[78]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[79]  Christopher D. Manning,et al.  Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks , 2016, LREC.

[80]  Julian N. Marewski,et al.  What can the brain teach us about building artificial intelligence? , 2016, Behavioral and Brain Sciences.

[81]  Ro'i Zultan,et al.  Causal Responsibility and Counterfactuals , 2013, Cogn. Sci..

[82]  Ben Coleman,et al.  Game, Set, Math , 2012 .

[83]  Diane Maclagan,et al.  The card game set , 2003 .

[84]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[85]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[86]  S. Vandenberg,et al.  Mental Rotations, a Group Test of Three-Dimensional Spatial Visualization , 1978, Perceptual and motor skills.

[87]  R. Shepard,et al.  Mental Rotation of Three-Dimensional Objects , 1971, Science.

[88]  John McCarthy,et al.  Programs with common sense , 1960 .

[89]  Ellie Pavlick,et al.  Mapping Language Models to Grounded Conceptual Spaces , 2022, ICLR.

[90]  Matt Gardner,et al.  Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning , 2022, EMNLP.

[91]  A. D'Amour,et al.  Counterfactual Invariance to Spurious Correlations in Text Classification , 2021, NeurIPS.

[92]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[93]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[94]  Kamalika Chaudhuri,et al.  ON THE COMPLEXITY OF THE GAME OF SET , 2003 .

[95]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[96]  Irene Heim,et al.  Semantics in generative grammar , 1998 .

[97]  John R. Anderson,et al.  The Transfer of Cognitive Skill , 1989 .