INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark, specifically designed to test agents' generalization ability. INT is based on a procedure for generating theorems and proofs; this procedure's knobs allow us to measure 6 different types of generalization, each reflecting a distinct challenge characteristic to automated theorem proving. In addition, unlike prior benchmarks for learning-assisted theorem proving, INT provides a lightweight and user-friendly theorem proving environment with fast simulations, conducive to performing learning-based and search-based research. We introduce learning-based baselines and evaluate them across 6 dimensions of generalization with the benchmark. We then evaluate the same agents augmented with Monte Carlo Tree Search (MCTS) at test time, and show that MCTS can help to prove new theorems.

[1]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[2]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[3]  Reiner Hähnle,et al.  A Theorem Proving Approach to Analysis of Secure Information Flow , 2005, SPC.

[4]  Craig E. Larson,et al.  A Survey of Research in Automated Mathematical Conjecture-Making , 2001, Graphs and Discovery.

[5]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Josef Urban,et al.  MaLeCoP Machine Learning Connection Prover , 2011, TABLEAUX.

[8]  H. O. Foulkes Abstract Algebra , 1967, Nature.

[9]  Andrei Voronkov,et al.  First-Order Theorem Proving and Vampire , 2013, CAV.

[10]  William McCune,et al.  Solution of the Robbins Problem , 1997, Journal of Automated Reasoning.

[11]  Jian Wang,et al.  Premise Selection for Theorem Proving by Deep Graph Embedding , 2017, NIPS.

[12]  John Harrison,et al.  HOL Light: A Tutorial Introduction , 1996, FMCAD.

[13]  Jiliang Tang,et al.  A Survey on Dialogue Systems: Recent Advances and New Frontiers , 2017, SKDD.

[14]  Alex Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[15]  Cezary Kaliszyk,et al.  HolStep: A Machine Learning Dataset for Higher-order Logic Theorem Proving , 2017, ICLR.

[16]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[17]  Hugo Herbelin,et al.  The Coq proof assistant : reference manual, version 6.1 , 1997 .

[18]  Jia Deng,et al.  Learning to Prove Theorems by Learning to Generate Theorems , 2020, NeurIPS.

[19]  Lawrence C. Paulson,et al.  Natural Deduction as Higher-Order Resolution , 1986, J. Log. Program..

[20]  Christine Paulin-Mohring,et al.  The coq proof assistant reference manual , 2000 .

[21]  Cezary Kaliszyk,et al.  Deep Network Guided Proof Search , 2017, LPAR.

[22]  Mark R. Greenstreet,et al.  Formal verification in hardware design: a survey , 1999, TODE.

[23]  Thibault Gauthier,et al.  Learning to Prove with Tactics , 2018, ArXiv.

[24]  Josef Urban,et al.  DeepMath - Deep Sequence Models for Premise Selection , 2016, NIPS.

[25]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[26]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[27]  James P. Bridge,et al.  Machine Learning for First-Order Theorem Proving , 2014, J. Autom. Reason..

[28]  Stephan Schulz,et al.  System Description: E 1.8 , 2013, LPAR.

[29]  Cezary Kaliszyk,et al.  Reinforcement Learning of Theorem Proving , 2018, NeurIPS.

[30]  Dawn Xiaodong Song,et al.  GamePad: A Learning Environment for Theorem Proving , 2018, ICLR.

[31]  Tim Rocktäschel,et al.  End-to-end Differentiable Proving , 2017, NIPS.

[32]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[33]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[34]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[35]  Lingfan Yu,et al.  Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. , 2019 .

[36]  Sarah M. Loos,et al.  Mathematical Reasoning in Latent Space , 2019, ICLR.

[37]  Sarah M. Loos,et al.  HOList: An Environment for Machine Learning of Higher Order Logic Theorem Proving , 2019, ICML.

[38]  Bernhard Schölkopf,et al.  A Tutorial Introduction , 2001 .

[39]  Jeremy Avigad,et al.  The Lean Theorem Prover (System Description) , 2015, CADE.

[40]  R. Petit A Tutorial Introduction , 1980 .