Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Logical reasoning consistently plays a fundamental and significant role in the domains of knowledge engineering and artificial intelligence. Recently, Large Language Models (LLMs) have emerged as a noteworthy innovation in natural language processing (NLP), exhibiting impressive achievements across various classic NLP tasks. However, the question of whether LLMs can effectively address the task of logical reasoning, which requires gradual cognitive inference similar to human intelligence, remains unanswered. To this end, we aim to bridge this gap and provide comprehensive evaluations in this paper. Firstly, to offer systematic evaluations, we select fifteen typical logical reasoning datasets and organize them into deductive, inductive, abductive and mixed-form reasoning settings. Considering the comprehensiveness of evaluations, we include three representative LLMs (i.e., text-davinci-003, ChatGPT and BARD) and evaluate them on all selected datasets under zero-shot, one-shot and three-shot settings. Secondly, different from previous evaluations relying only on simple metrics (e.g., accuracy), we propose fine-level evaluations from objective and subjective manners, covering both answers and explanations. Additionally, to uncover the logical flaws of LLMs, problematic cases will be attributed to five error types from two dimensions, i.e., evidence selection process and reasoning process. Thirdly, to avoid the influences of knowledge bias and purely focus on benchmarking the logical reasoning capability of LLMs, we propose a new dataset with neutral content. It contains 3,000 samples and covers deductive, inductive and abductive settings. Based on the in-depth evaluations, this paper finally forms a general evaluation scheme of logical reasoning capability from six dimensions. It reflects the pros and cons of LLMs and gives guiding directions for future works.

[1]  Andrew M. Dai,et al.  PaLM 2 Technical Report , 2023, ArXiv.

[2]  Maosong Sun,et al.  C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , 2023, NeurIPS.

[3]  Xiaozhi Wang,et al.  ChatLog: Recording and Analyzing ChatGPT Across Time , 2023, ArXiv.

[4]  Yuexin Zhang,et al.  Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4 , 2023, ArXiv.

[5]  Chunyuan Li,et al.  Instruction Tuning with GPT-4 , 2023, ArXiv.

[6]  Qika Lin,et al.  Contrastive Graph Representations for Logical Formulas Embedding , 2023, IEEE Transactions on Knowledge and Data Engineering.

[7]  Wayne Xin Zhao,et al.  A Survey of Large Language Models , 2023, ArXiv.

[8]  Le Sun,et al.  ChatGPT Is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models , 2023, LREC.

[9]  Fei Yu,et al.  Natural Language Reasoning, A Survey , 2023, ACM Computing Surveys.

[10]  E. Cambria,et al.  Logical Reasoning over Natural Language as Knowledge Representation: A Survey , 2023, ArXiv.

[11]  Shima Imani,et al.  MathPrompter: Mathematical Reasoning using Large Language Models , 2023, ACL.

[12]  Björn Schuller,et al.  Will Affective Computing Emerge From Foundation Models and General Artificial Intelligence? A First Evaluation of ChatGPT , 2023, IEEE Intelligent Systems.

[13]  Michihiro Yasunaga,et al.  Is ChatGPT a General-Purpose Natural Language Processing Task Solver? , 2023, EMNLP.

[14]  Dan Su,et al.  A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , 2023, IJCNLP.

[15]  Percy Liang,et al.  Benchmarking Large Language Models for News Summarization , 2023, ArXiv.

[16]  Yao Zhou,et al.  Towards High-Order Complementary Recommendation via Logical Reasoning Network , 2022, 2022 IEEE International Conference on Data Mining (ICDM).

[17]  Shafiq R. Joty,et al.  FOLIO: Natural Language Reasoning with First-Order Logic , 2022, ArXiv.

[18]  Zijian Huang,et al.  LinE: Logical Query Reasoning over Hierarchical Knowledge Graphs , 2022, KDD.

[19]  Yizhou Sun,et al.  RLogic: Recursive Logical Rule Learning from Knowledge Graphs , 2022, KDD.

[20]  Qika Lin,et al.  Incorporating Context Graph with Logical Reasoning for Inductive Relation Prediction , 2022, SIGIR.

[21]  Michael Witbrock,et al.  AbductionRules: Training Transformers to Explain Unexpected Inputs , 2022, FINDINGS.

[22]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[23]  Liqiang Nie,et al.  MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning , 2022, FINDINGS.

[24]  Pascale Fung,et al.  Survey of Hallucination in Natural Language Generation , 2022, ACM Comput. Surv..

[25]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[26]  G. Strang,et al.  A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level , 2021, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Oyvind Tafjord,et al.  Explaining Answers with Entailment Trees , 2021, EMNLP.

[28]  Jian Yin,et al.  Multi-Hop Reasoning Question Generation and Its Application , 2021, IEEE Transactions on Knowledge and Data Engineering.

[29]  Peter Clark,et al.  ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language , 2020, FINDINGS.

[30]  M. Goedhart,et al.  Logical Reasoning in Formal and Everyday Reasoning Tasks , 2020 .

[31]  Edward Grefenstette,et al.  Learning Reasoning Strategies in End-to-End Differentiable Proving , 2020, ICML.

[32]  Yue Zhang,et al.  LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning , 2020, IJCAI.

[33]  Jonathan Berant,et al.  Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge , 2020, ArXiv.

[34]  Oyvind Tafjord,et al.  Transformers as Soft Reasoners over Language , 2020, IJCAI.

[35]  Jiashi Feng,et al.  ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning , 2020, ICLR.

[36]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[37]  Joelle Pineau,et al.  CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text , 2019, EMNLP.

[38]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[39]  Evan Heit,et al.  Relations between inductive reasoning and deductive reasoning. , 2010, Journal of experimental psychology. Learning, memory, and cognition.

[40]  Vinod Goel,et al.  Anatomy of deductive reasoning , 2007, Trends in Cognitive Sciences.

[41]  Thomas Lukasiewicz,et al.  A Novel Combination of Answer Set Programming with Description Logics for the Semantic Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[42]  Grigoris Antoniou,et al.  DR-Prolog: A System for Defeasible Reasoning with Rules and Ontologies on the Semantic Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[43]  Peter A. Flach,et al.  Abductive and inductive reasoning: background and issues , 2000 .

[44]  Frank Z. Xing,et al.  SenticNet 7: A Commonsense-based Neurosymbolic AI Framework for Explainable Sentiment Analysis , 2022, LREC.

[45]  Qika Lin,et al.  Inductive Relation Prediction with Logical Reasoning Using Contrastive Representations , 2022, EMNLP.

[46]  Aidong Zhang,et al.  A Survey on Context Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[47]  D. Walton Abductive, presumptive and plausible arguments , 2001 .

[48]  Peter A. Flach,et al.  Abduction and induction: essays on their relation and integration , 2000 .

[49]  P N Johnson-Laird,et al.  Deductive reasoning. , 1999, Annual review of psychology.

[50]  John R. Josephson,et al.  Abductive inference : computation, philosophy, technology , 1994 .