论文信息 - Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4 - 字舞流文

Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

Harnessing logical reasoning ability is a comprehensive natural language understanding endeavor. With the release of Generative Pretrained Transformer 4 (GPT-4), highlighted as"advanced"at reasoning tasks, we are eager to learn the GPT-4 performance on various logical reasoning tasks. This report analyses multiple logical reasoning datasets, with popular benchmarks like LogiQA and ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice reading comprehension and natural language inference tasks with benchmarks requiring logical reasoning. We further construct a logical reasoning out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4. We also make a performance comparison between ChatGPT and GPT-4. Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. With early access to the GPT-4 API we are able to conduct intense experiments on the GPT-4 model. The results show GPT-4 yields even higher performance on most logical reasoning datasets. Among benchmarks, ChatGPT and GPT-4 do relatively well on well-known datasets like LogiQA and ReClor. However, the performance drops significantly when handling newly released and out-of-distribution datasets. Logical reasoning remains challenging for ChatGPT and GPT-4, especially on out-of-distribution and natural language inference datasets. We release the prompt-style logical reasoning datasets as a benchmark suite and name it LogiEval.

Yuexin Zhang | Ruoxi Ning | Zhiyang Teng | Jian Liu | Hanmeng Liu | Qiji Zhou | Qiji Zhou

[1] Dan Su,et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , 2023, IJCNLP.

[2] Michihiro Yasunaga,et al. Is ChatGPT a General-Purpose Natural Language Processing Task Solver? , 2023, EMNLP.

[3] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[4] Nan Duan,et al. From LSAT: The Progress and Challenges of Complex Reasoning , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Yue Zhang,et al. Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts , 2020, AAAI.

[6] Mohit Bansal,et al. ConjNLI: Natural Language Inference over Conjunctive Sentences , 2020, EMNLP.

[7] M. Choudhury,et al. TaxiNLI: Taking a Ride up the NLU Hill , 2020, CONLL.

[8] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[9] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[10] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11] Johan Bos,et al. Can Neural Networks Understand Monotonicity Reasoning? , 2019, BlackboxNLP@ACL.

[12] Vivek Srikumar,et al. Augmenting Neural Networks with First-order Logic , 2019, ACL.

[13] Johan Bos,et al. HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning , 2019, *SEMEVAL.

[14] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[16] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[17] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[18] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[19] Christopher D. Manning,et al. Natural Logic for Textual Inference , 2007, ACL-PASCAL@ACL.

[20] Lucja Iwanska,et al. Logical reasoning in natural language: It is all about knowledge , 1993, Minds and Machines.

[21] Robert A. Kowalski,et al. Logic for problem solving , 1982, The computer science library : Artificial intelligence series.