Small Language Models Improve Giants by Rewriting Their Outputs

Large language models (LLMs) have demonstrated impressive few-shot learning capabilities, but they often underperform compared to fine-tuned models on challenging tasks. Furthermore, their large size and restricted access only through APIs make task-specific fine-tuning impractical. Moreover, LLMs are sensitive to different aspects of prompts (e.g., the selection and order of demonstrations) and can thus require time-consuming prompt engineering. In this light, we propose a method to correct LLM outputs without relying on their weights. First, we generate a pool of candidates by few-shot prompting an LLM. Second, we refine the LLM-generated outputs using a smaller model, the LM-corrector (LMCor), which is trained to rank, combine and rewrite the candidates to produce the final target output. Our experiments demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B) across diverse tasks. Moreover, we illustrate that the LMCor exhibits robustness against different prompts, thereby minimizing the need for extensive prompt engineering. Finally, we showcase that the LMCor can be seamlessly integrated with different LLMs at inference time, serving as a plug-and-play module to improve their performance.

[1]  Hongsheng Li,et al.  LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention , 2023, ArXiv.

[2]  Ashish Sabharwal,et al.  Specializing Smaller Language Models towards Multi-Step Reasoning , 2023, ICML.

[3]  Orhan Firat,et al.  Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction , 2023, ArXiv.

[4]  Alexandra Birch,et al.  Prompting Large Language Model for Machine Translation: A Case Study , 2023, ICML.

[5]  Dan Jurafsky,et al.  Follow the Wisdom of the Crowd: Effective Text Generation via Minimum Bayes Risk Decoding , 2022, ACL.

[6]  Xiang Ren,et al.  PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales , 2022, ICLR.

[7]  Yejin Choi,et al.  Generating Sequences by Learning to Self-Correct , 2022, ICLR.

[8]  Hua Wu,et al.  Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards , 2022, EMNLP.

[9]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[10]  Noah A. Smith,et al.  Measuring and Narrowing the Compositionality Gap in Language Models , 2022, ArXiv.

[11]  Keith B. Hall,et al.  Promptagator: Few-shot Dense Retrieval From 8 Examples , 2022, ICLR.

[12]  Dan Iter,et al.  Generate rather than Retrieve: Large Language Models are Strong Context Generators , 2022, ICLR.

[13]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[14]  Dan Jurafsky,et al.  Prompt-and-Rerank: A Method for Zero-Shot and Few-Shot Arbitrary Textual Style Transfer with Small Language Models , 2022, EMNLP.

[15]  D. Schuurmans,et al.  Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, ICLR.

[16]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[17]  Stella Rose Biderman,et al.  GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[18]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[19]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[20]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[21]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[22]  Ronan Le Bras,et al.  Generated Knowledge Prompting for Commonsense Reasoning , 2021, ACL.

[23]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[24]  S. Riedel,et al.  Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity , 2021, ACL.

[25]  Weizhu Chen,et al.  What Makes Good In-Context Examples for GPT-3? , 2021, DEELIO.

[26]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[27]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[28]  Jure Leskovec,et al.  LM-Critic: Language Models for Unsupervised Grammatical Error Correction , 2021, EMNLP.

[29]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[30]  Aliaksei Severyn,et al.  A Simple Recipe for Multilingual Grammatical Error Correction , 2021, ACL.

[31]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[32]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[33]  Graham Neubig,et al.  How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering , 2020, Transactions of the Association for Computational Linguistics.

[34]  Joe Davison,et al.  Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , 2021, NeurIPS.

[35]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[36]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[37]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[38]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[39]  Verena Rieser,et al.  Semantic Noise Matters for Neural Natural Language Generation , 2019, INLG.

[40]  Ted Briscoe,et al.  The BEA-2019 Shared Task on Grammatical Error Correction , 2019, BEA@ACL.

[41]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[42]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[43]  Verena Rieser,et al.  The E2E Dataset: New Challenges For End-to-End Generation , 2017, SIGDIAL Conference.

[44]  Raymond Hendy Susanto,et al.  The CoNLL-2014 Shared Task on Grammatical Error Correction , 2014 .

[45]  Hwee Tou Ng,et al.  Better Evaluation for Grammatical Error Correction , 2012, NAACL.

[46]  Helen Yannakoudakis,et al.  A New Dataset and Method for Automatically Grading ESOL Texts , 2011, ACL.

[47]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.