论文信息 - IsarStep: a Benchmark for High-level Mathematical Reasoning - 字舞流文

IsarStep: a Benchmark for High-level Mathematical Reasoning

A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a hierarchical transformer that outperforms the transformer baseline. We will make the dataset and models publicly available.

Lawrence C. Paulson | Wenda Li | Yuhuai Wu | Lawrence Charles Paulson | Lei Yu | Yuhuai Wu | Wenda Li | Lei Yu

[1] Sarah M. Loos,et al. Graph Representations for Higher-Order Logic and Theorem Proving , 2019, AAAI.

[2] Dawn Xiaodong Song,et al. GamePad: A Learning Environment for Theorem Proving , 2018, ICLR.

[3] Wojciech Zaremba,et al. Learning to Execute , 2014, ArXiv.

[4] Jan Jakubuv,et al. First Neural Conjecturing Datasets and Experiments , 2020, CICM.

[5] Cezary Kaliszyk,et al. HolStep: A Machine Learning Dataset for Higher-order Logic Theorem Proving , 2017, ICLR.

[6] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[7] Lawrence Charles Paulson,et al. Isabelle: A Generic Theorem Prover , 1994 .

[8] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[9] Sorin Lerner,et al. Generating correctness proofs with neural networks , 2019, MAPL@PLDI.

[10] Christian Szegedy,et al. A Promising Path Towards Autoformalization and General Artificial Intelligence , 2020, CICM.

[11] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[12] Lawrence C. Paulson,et al. Extending Sledgehammer with SMT Solvers , 2011, CADE.

[13] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[14] Chris Dyer,et al. Neural Arithmetic Logic Units , 2018, NeurIPS.

[15] Pushmeet Kohli,et al. Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.

[16] Yutaka Nagashima,et al. Simple Dataset for Proof Method Recommendation in Isabelle/HOL (Dataset Description) , 2020, CICM.

[17] Ilya Sutskever,et al. Generative Language Modeling for Automated Theorem Proving , 2020, ArXiv.

[18] Adam Naumowicz,et al. Mizar in a Nutshell , 2010, J. Formaliz. Reason..

[19] Guillaume Lample,et al. Deep Learning for Symbolic Mathematics , 2019, ICLR.

[20] Thibault Gauthier,et al. Initial Experiments with Statistical Conjecturing over Large Formal Corpora , 2016, FM4M/MathUI/ThEdu/DP/WIP@CIKM.

[21] Josef Urban,et al. DeepMath - Deep Sequence Models for Premise Selection , 2016, NIPS.

[22] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23] Sarah M. Loos,et al. HOList: An Environment for Machine Learning of Higher Order Logic Theorem Proving , 2019, ICML.

[24] Diyi Yang,et al. Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[25] Ming Zhou,et al. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[26] Thibault Gauthier,et al. TacticToe: Learning to Reason with HOL4 Tactics , 2017, LPAR.

[27] Qingxiang Wang,et al. Exploration of neural machine translation in autoformalization of mathematics in Mizar , 2019, CPP.

[28] Bernd Finkbeiner,et al. Teaching Temporal Logics to Neural Networks , 2020, ICLR.

[29] Christian Szegedy,et al. Mathematical Reasoning via Self-supervised Skip-tree Training , 2020, ICLR.

[30] Guillaume Lample,et al. Deep Differential System Stability - Learning advanced computations from examples , 2020, ArXiv.

[31] Wang Ling,et al. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[32] Mirella Lapata,et al. Hierarchical Transformers for Multi-Document Summarization , 2019, ACL.

[33] Stefan Berghofer,et al. The Isabelle/Isar Implementation , 2016 .

[34] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[35] Josef Urban,et al. Guiding Inferences in Connection Tableau by Recurrent Neural Networks , 2020, CICM.

[36] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37] Richard Evans,et al. Can Neural Networks Understand Logical Entailment? , 2018, ICLR.

[38] Jia Deng,et al. Learning to Prove Theorems via Interacting with Proof Assistants , 2019, ICML.

[39] Jia Deng,et al. Learning to Prove Theorems by Learning to Generate Theorems , 2020, NeurIPS.

[40] Lukasz Kaiser,et al. Neural GPUs Learn Algorithms , 2015, ICLR.

[41] Jimmy Ba,et al. INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving , 2020, ICLR.