Distilling Multi-Step Reasoning Capabilities of Large Language Models into Smaller Models via Semantic Decompositions

Step-by-step reasoning approaches like chain-of-thought ( CoT ) have proved to be a very effective technique to induce reasoning capabilities in large language models. However, the success of the CoT approach depends primarily on model size, and often billion parameter-scale models are needed to get CoT to work. In this paper, we propose a knowledge distillation approach, that leverages the step-by-step CoT reasoning capabilities of larger models and distils these reasoning abilities into smaller models. Our approach D ECOMPOSI - TIONAL D ISTILLATION learns a semantic decomposition of the original problem into a sequence of subproblems and uses it to train two models: a) a problem decomposer that learns to decompose the complex reasoning problem into a sequence of simpler sub-problems and b) a problem solver that uses the intermediate subproblems to solve the overall problem. On a multi-step math word problem dataset (GSM8K), we boost the performance of GPT-2 variants up to 35% when distilled with our approach compared to CoT . We show that using our approach, it is possible to train a GPT-2-large model (775M) that can outperform a 10X larger GPT-3 (6B) model trained using CoT reasoning. Finally, we also demonstrate that our approach of problem decomposition can also be used as an alternative to CoT prompting, which boosts the GPT-3 performance by 40% compared to CoT prompts.

[1]  Manu Kapur,et al.  Automatic Generation of Socratic Subquestions for Teaching Math Word Problems , 2022, EMNLP.

[2]  Mrinmaya Sachan,et al.  A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models , 2022, ACL.

[3]  S. Gu,et al.  Large Language Models Can Self-Improve , 2022, EMNLP.

[4]  Xifeng Yan,et al.  Explanations from Large Language Models Make Small Reasoners Better , 2022, ArXiv.

[5]  Jacob Eisenstein,et al.  Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model , 2022, ArXiv.

[6]  D. Klein,et al.  Learning by Distilling Context , 2022, ArXiv.

[7]  Yuhuai Wu,et al.  Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[8]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[9]  Ronan Le Bras,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[10]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[11]  D. Schuurmans,et al.  Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, ICLR.

[12]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[13]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[14]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[15]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[16]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[17]  Yan Wang,et al.  Graph-to-Tree Learning for Solving Math Word Problems , 2020, ACL.

[18]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[19]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[20]  Ronan Le Bras,et al.  Unsupervised Commonsense Question Answering with Self-Talk , 2020, EMNLP.

[21]  Tassilo Klein,et al.  Learning to Answer by Learning to Ask: Getting the Best of GPT-2 and BERT Worlds , 2019, ArXiv.

[22]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[23]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[24]  Zhipeng Xie,et al.  A Goal-Driven Tree-Structured Neural Model for Math Word Problems , 2019, IJCAI.

[25]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[26]  Yejin Choi,et al.  MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms , 2019, NAACL.

[27]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[28]  Hal Daumé,et al.  Answer-based Adversarial Training for Generating Clarification Questions , 2019, NAACL.

[29]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[30]  Eric P. Xing,et al.  Learning to Solve Geometry Problems from Natural Language Demonstrations in Textbooks , 2017, *SEMEVAL.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Oren Etzioni,et al.  Solving Geometry Problems: Combining Text and Diagram Interpretation , 2015, EMNLP.

[33]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[34]  Dan Roth,et al.  Reasoning about Quantities in Natural Language , 2015, TACL.

[35]  Oren Etzioni,et al.  Learning to Solve Arithmetic Word Problems with Verb Categorization , 2014, EMNLP.

[36]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[37]  Luke S. Zettlemoyer,et al.  Learning to Automatically Solve Algebra Word Problems , 2014, ACL.

[38]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[39]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[40]  J. Fantuzzo,et al.  Effects of reciprocal peer tutoring on academic achievement and psychological adjustment: A component analysis. , 1989 .

[41]  J. Bruner,et al.  The role of tutoring in problem solving. , 1976, Journal of child psychology and psychiatry, and allied disciplines.

[42]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[43]  T. Wood Patterns of Interaction and the Culture of Mathematics Classrooms , 1994 .

[44]  J. Bruner The act of discovery. , 1961 .