IsarStep: a Benchmark for High-level Mathematical Reasoning

A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a hierarchical transformer that outperforms the transformer baseline. We will make the dataset and models publicly available.

[1]  Sarah M. Loos,et al.  Graph Representations for Higher-Order Logic and Theorem Proving , 2019, AAAI.

[2]  Dawn Xiaodong Song,et al.  GamePad: A Learning Environment for Theorem Proving , 2018, ICLR.

[3]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[4]  Jan Jakubuv,et al.  First Neural Conjecturing Datasets and Experiments , 2020, CICM.

[5]  Cezary Kaliszyk,et al.  HolStep: A Machine Learning Dataset for Higher-order Logic Theorem Proving , 2017, ICLR.

[6]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[7]  Lawrence Charles Paulson,et al.  Isabelle: A Generic Theorem Prover , 1994 .

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Sorin Lerner,et al.  Generating correctness proofs with neural networks , 2019, MAPL@PLDI.

[10]  Christian Szegedy,et al.  A Promising Path Towards Autoformalization and General Artificial Intelligence , 2020, CICM.

[11]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[12]  Lawrence C. Paulson,et al.  Extending Sledgehammer with SMT Solvers , 2011, CADE.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Chris Dyer,et al.  Neural Arithmetic Logic Units , 2018, NeurIPS.

[15]  Pushmeet Kohli,et al.  Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.

[16]  Yutaka Nagashima,et al.  Simple Dataset for Proof Method Recommendation in Isabelle/HOL (Dataset Description) , 2020, CICM.

[17]  Ilya Sutskever,et al.  Generative Language Modeling for Automated Theorem Proving , 2020, ArXiv.

[18]  Adam Naumowicz,et al.  Mizar in a Nutshell , 2010, J. Formaliz. Reason..

[19]  Guillaume Lample,et al.  Deep Learning for Symbolic Mathematics , 2019, ICLR.

[20]  Thibault Gauthier,et al.  Initial Experiments with Statistical Conjecturing over Large Formal Corpora , 2016, FM4M/MathUI/ThEdu/DP/WIP@CIKM.

[21]  Josef Urban,et al.  DeepMath - Deep Sequence Models for Premise Selection , 2016, NIPS.

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  Sarah M. Loos,et al.  HOList: An Environment for Machine Learning of Higher Order Logic Theorem Proving , 2019, ICML.

[24]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[25]  Ming Zhou,et al.  HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[26]  Thibault Gauthier,et al.  TacticToe: Learning to Reason with HOL4 Tactics , 2017, LPAR.

[27]  Qingxiang Wang,et al.  Exploration of neural machine translation in autoformalization of mathematics in Mizar , 2019, CPP.

[28]  Bernd Finkbeiner,et al.  Teaching Temporal Logics to Neural Networks , 2020, ICLR.

[29]  Christian Szegedy,et al.  Mathematical Reasoning via Self-supervised Skip-tree Training , 2020, ICLR.

[30]  Guillaume Lample,et al.  Deep Differential System Stability - Learning advanced computations from examples , 2020, ArXiv.

[31]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[32]  Mirella Lapata,et al.  Hierarchical Transformers for Multi-Document Summarization , 2019, ACL.

[33]  Stefan Berghofer,et al.  The Isabelle/Isar Implementation , 2016 .

[34]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[35]  Josef Urban,et al.  Guiding Inferences in Connection Tableau by Recurrent Neural Networks , 2020, CICM.

[36]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37]  Richard Evans,et al.  Can Neural Networks Understand Logical Entailment? , 2018, ICLR.

[38]  Jia Deng,et al.  Learning to Prove Theorems via Interacting with Proof Assistants , 2019, ICML.

[39]  Jia Deng,et al.  Learning to Prove Theorems by Learning to Generate Theorems , 2020, NeurIPS.

[40]  Lukasz Kaiser,et al.  Neural GPUs Learn Algorithms , 2015, ICLR.

[41]  Jimmy Ba,et al.  INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving , 2020, ICLR.