Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

Harnessing logical reasoning ability is a comprehensive natural language understanding endeavor. With the release of Generative Pretrained Transformer 4 (GPT-4), highlighted as"advanced"at reasoning tasks, we are eager to learn the GPT-4 performance on various logical reasoning tasks. This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice reading comprehension and natural language inference tasks with benchmarks requiring logical reasoning. We further construct a logical reasoning out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4. We also make a performance comparison between ChatGPT and GPT-4. Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. With early access to the GPT-4 API we are able to conduct intense experiments on the GPT-4 model. The results show GPT-4 yields even higher performance on most logical reasoning datasets. Among benchmarks, ChatGPT and GPT-4 do relatively well on well-known datasets like LogiQA and ReClor. However, the performance drops significantly when handling newly released and out-of-distribution datasets. Logical reasoning remains challenging for ChatGPT and GPT-4, especially on out-of-distribution and natural language inference datasets. We release the prompt-style logical reasoning datasets as a benchmark suite and name it LogiEval.

[1]  Dan Su,et al.  A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , 2023, IJCNLP.

[2]  Michihiro Yasunaga,et al.  Is ChatGPT a General-Purpose Natural Language Processing Task Solver? , 2023, EMNLP.

[3]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[4]  Nan Duan,et al.  From LSAT: The Progress and Challenges of Complex Reasoning , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Yue Zhang,et al.  Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts , 2020, AAAI.

[6]  Mohit Bansal,et al.  ConjNLI: Natural Language Inference over Conjunctive Sentences , 2020, EMNLP.

[7]  M. Choudhury,et al.  TaxiNLI: Taking a Ride up the NLU Hill , 2020, CONLL.

[8]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[9]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Johan Bos,et al.  Can Neural Networks Understand Monotonicity Reasoning? , 2019, BlackboxNLP@ACL.

[12]  Vivek Srikumar,et al.  Augmenting Neural Networks with First-order Logic , 2019, ACL.

[13]  Johan Bos,et al.  HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning , 2019, *SEMEVAL.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[16]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[19]  Christopher D. Manning,et al.  Natural Logic for Textual Inference , 2007, ACL-PASCAL@ACL.

[20]  Lucja Iwanska,et al.  Logical reasoning in natural language: It is all about knowledge , 1993, Minds and Machines.

[21]  Robert A. Kowalski,et al.  Logic for problem solving , 1982, The computer science library : Artificial intelligence series.