Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2

Thinking aloud is an effective meta-cognitive strategy human reasoners apply to solve difficult problems. We suggest to improve the reasoning ability of pre-trained neural language models in a similar way, namely by expanding a task’s context with problem elaborations that are dynamically generated by the language model itself. Our main result is that dynamic problem elaboration significantly improves the zero-shot performance of GPT-2 in a deductive reasoning and natural language inference task: While the model uses a syntactic heuristic for predicting an answer, it is capable (to some degree) of generating reasoned additional context which facilitates the successful application of its heuristic. We explore different ways of generating elaborations, including fewshot learning, and find that their relative performance varies with the specific problem characteristics (such as problem difficulty). Moreover, the effectiveness of an elaboration can be explained in terms of the degree to which the elaboration semantically coheres with the corresponding problem. In particular, elaborations that are most faithful to the original problem description may boost accuracy by up to 24%.

[1]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[2]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[3]  Ashish Sabharwal,et al.  What Does My QA Model Know? Devising Controlled Probes Using Expert Knowledge , 2019, Transactions of the Association for Computational Linguistics.

[4]  Jonathan Berant,et al.  Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge , 2020, ArXiv.

[5]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[6]  Mohit Bansal,et al.  PRover: Proof Generation for Interpretable Reasoning over Rules , 2020, EMNLP.

[7]  Chitta Baral,et al.  Self-Supervised Knowledge Triplet Learning for Zero-shot Question Answering , 2020, EMNLP.

[8]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[9]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[10]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[11]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[12]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[13]  Mary E. Ahlum-Heath,et al.  The effect of conscious controlled verbalization cognitive strategy on transfer in problem solving , 1986, Memory & cognition.

[14]  Neil Charness,et al.  How to Gain Eleven IQ Points in Ten Minutes: Thinking Aloud Improves Raven's Matrices Performance in Older Adults , 2010, Neuropsychology, development, and cognition. Section B, Aging, neuropsychology and cognition.

[15]  Yejin Choi,et al.  Learning to Rationalize for Nonmonotonic Reasoning with Distant Supervision , 2020, AAAI.

[16]  Peter Clark,et al.  Transformers as Soft Reasoners over Language , 2020, ArXiv.

[17]  Yonatan Bisk,et al.  Knowledge-driven Self-supervision for Zero-shot Commonsense Question Answering , 2020, ArXiv.

[18]  Jifan Chen,et al.  Multi-hop Question Answering via Reasoning Chains , 2019, ArXiv.

[19]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[20]  Joelle Pineau,et al.  CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text , 2019, EMNLP.

[21]  Yejin Choi,et al.  Dynamic Neuro-Symbolic Knowledge Graph Construction for Zero-shot Commonsense Question Answering , 2020, AAAI.

[22]  Marja Vauras,et al.  Improving Reading through Thinking Aloud. , 1992 .

[23]  Hinrich Schutze,et al.  Negated LAMA: Birds cannot fly , 2019, ArXiv.

[24]  W. Hacker,et al.  Reflective verbalization improves solutions: The effects of question-based reflection in design problem solving , 2004 .

[25]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[26]  Johan Bos,et al.  Can Neural Networks Understand Monotonicity Reasoning? , 2019, BlackboxNLP@ACL.

[27]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[28]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[29]  S. Frederick Journal of Economic Perspectives—Volume 19, Number 4—Fall 2005—Pages 25–42 Cognitive Reflection and Decision Making , 2022 .

[30]  Gregor Betz,et al.  Critical Thinking for Language Models , 2020, IWCS.

[31]  Steven W. Evans,et al.  Thinking aloud during problem solving: Facilitation effects , 1991 .

[32]  Siva Reddy,et al.  Measuring Systematic Generalization in Neural Proof Generation with Transformers , 2020, NeurIPS.

[33]  Roy Schwartz,et al.  Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets , 2019, NAACL.

[34]  Hinrich Schutze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[35]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[36]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[37]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[38]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[39]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[40]  R. Gagne,et al.  A study of the effects of verbalization on problem solving. , 1962, Journal of experimental psychology.

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[42]  Yejin Choi,et al.  Thinking Like a Skeptic: Defeasible Inference in Natural Language , 2020, FINDINGS.

[43]  Ronan Le Bras,et al.  Unsupervised Commonsense Question Answering with Self-Talk , 2020, EMNLP.

[44]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[45]  Lawrence S. Moss,et al.  Probing Natural Language Inference Models through Semantic Fragments , 2020, AAAI.

[46]  Fabio Petroni,et al.  How Context Affects Language Models' Factual Predictions , 2020, AKBC.

[47]  Graham Neubig,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[48]  Jonathan Berant,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[49]  A. Whimbey,et al.  Teaching analytical reasoning through thinking aloud pair problem solving , 1987 .