Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability

Investigating the reasoning abilities of transformer models, and discovering new challenging tasks for them, has been a topic of much interest. Recent studies have found these models to be surprisingly strong at performing deductive reasoning over formal logical theories expressed in natural language. A shortcoming of these studies, however, is that they do not take into account that logical theories, when sampled uniformly at random, do not necessarily lead to hard instances. We propose a new methodology for creating challenging algorithmic reasoning datasets that focus on natural language satisfiability (NLSat) problems. The key idea is to draw insights from empirical sampling of hard propositional SAT problems and from complexity-theoretic studies of language. This methodology allows us to distinguish easy from hard instances, and to systematically increase the complexity of existing reasoning benchmarks such as RuleTaker. We find that current transformers, given sufficient training data, are surprisingly robust at solving the resulting NLSat problems of substantially increased difficulty. They also exhibit some degree of scale-invariance—the ability to generalize to problems of larger size and scope. Our results, however, reveal important limitations too: a careful sampling of training data is crucial for building models that generalize to larger problems, and transformer models’ limited scale-invariance suggests they are far from learning robust deductive reasoning algorithms.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Stephen A. Cook,et al.  The complexity of theorem-proving procedures , 1971, STOC.

[3]  Rishabh Singh,et al.  Synthetic Datasets for Neural Program Synthesis , 2019, ICLR.

[4]  Diego Calvanese,et al.  The Data Complexity of the Syllogistic Fragments of English , 2009, Amsterdam Colloquium on Logic, Language and Meaning.

[5]  Bart Selman,et al.  Planning as Satisfiability , 1992, ECAI.

[6]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[7]  Peter C. Cheeseman,et al.  Where the Really Hard Problems Are , 1991, IJCAI.

[8]  Mathijs Mul,et al.  Compositionality Decomposed: How do Neural Networks Generalise? , 2019, J. Artif. Intell. Res..

[9]  Hector J. Levesque,et al.  Some Pitfalls for Experimenters with Random SAT , 1996, Artif. Intell..

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Peter Clark,et al.  Transformers as Soft Reasoners over Language , 2020, ArXiv.

[12]  Jonathan Berant,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[13]  Hector J. Levesque,et al.  Generating Hard Satisfiability Problems , 1996, Artif. Intell..

[14]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[15]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[16]  Lawrence S. Moss,et al.  LOGICS FOR THE RELATIONAL SYLLOGISTIC , 2008, The Review of Symbolic Logic.

[17]  Gregor Betz,et al.  Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2 , 2021, ArXiv.

[18]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[19]  Joelle Pineau,et al.  CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text , 2019, EMNLP.

[20]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[21]  Mihai Surdeanu,et al.  Explainable Multi-hop Verbal Reasoning Through Internal Monologue , 2021, NAACL.

[22]  Peter Clark,et al.  ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language , 2020, FINDINGS.

[23]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[24]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[25]  B. Hayes,et al.  On the Threshold , 2003, American Scientist.

[26]  Aaron Traylor,et al.  AND does not mean OR: Using Formal Languages to Study Language Models' Representations , 2021, ACL/IJCNLP.

[27]  Hinrich Schutze,et al.  Pre-trained Language Models as Symbolic Reasoners over Knowledge? , 2020, ArXiv.

[28]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[29]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[30]  Olivier Roussel,et al.  The International SAT Solver Competitions , 2012, AI Mag..

[31]  Ian Pratt-Hartmann,et al.  Fragments of Language , 2004, J. Log. Lang. Inf..

[32]  Lawrence S. Moss,et al.  Probing Natural Language Inference Models through Semantic Fragments , 2020, AAAI.

[33]  Richard Futrell,et al.  Sensitivity as a Complexity Measure for Sequence Classification Tasks , 2021, Transactions of the Association for Computational Linguistics.

[34]  David L. Dill,et al.  Learning a SAT Solver from Single-Bit Supervision , 2018, ICLR.

[35]  Gregor Betz,et al.  Critical Thinking for Language Models , 2020, IWCS.

[36]  Dawn Xiaodong Song,et al.  Making Neural Programming Architectures Generalize via Recursion , 2017, ICLR.

[37]  Ronen Tamari,et al.  Dyna-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmarking , 2021, STARSEM.

[38]  Dian Yu,et al.  CLUE: A Chinese Language Understanding Evaluation Benchmark , 2020, COLING.

[39]  Hantao Zhang,et al.  An Efficient Algorithm for Unit Propagation , 1996 .

[40]  Dag Westerståhl Some results on quantifiers , 1984, Notre Dame J. Formal Log..

[41]  David G. Mitchell,et al.  Finding hard instances of the satisfiability problem: A survey , 1996, Satisfiability Problem: Theory and Applications.

[42]  Richard Evans,et al.  Can Neural Networks Understand Logical Entailment? , 2018, ICLR.

[43]  Daniel Khashabi,et al.  Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models , 2021, NAACL.

[44]  Mohit Bansal,et al.  PRover: Proof Generation for Interpretable Reasoning over Rules , 2020, EMNLP.

[45]  Kentaro Inui,et al.  Do Neural Models Learn Systematicity of Monotonicity Inference in Natural Language? , 2020, ACL.

[46]  Siva Reddy,et al.  Measuring Systematic Generalization in Neural Proof Generation with Transformers , 2020, NeurIPS.

[47]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[48]  Jakub Szymanik,et al.  Quantifiers and Cognition - Logical and Computational Perspectives , 2016, Studies in linguistics and philosophy.

[49]  Bart Selman,et al.  Encoding Plans in Propositional Logic , 1996, KR.

[50]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[51]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[52]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[53]  Elaine J. Weyuker,et al.  Computability, complexity, and languages - fundamentals of theoretical computer science , 2014, Computer science and applied mathematics.

[54]  Christopher Potts,et al.  ReaSCAN: Compositional Reasoning in Language Grounding , 2021, NeurIPS Datasets and Benchmarks.

[55]  Tom M. Mitchell,et al.  A Generative Symbolic Model for More General Natural Language Understanding and Reasoning , 2021, ArXiv.

[56]  Rémi Monasson,et al.  Determining computational complexity from characteristic ‘phase transitions’ , 1999, Nature.

[57]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[58]  Allan Third,et al.  More Fragments of Language , 2006, Notre Dame J. Formal Log..