Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions. Knowledge distillation has been shown to be a practical solution to reduce inference costs, but the distillation process itself requires significant computational resources. Rather than buying or renting GPUs to fine-tune, then distill a large model, an NLP practitioner might instead choose to allocate the available budget to hire annotators and manually label additional fine-tuning data. In this paper, we investigate how to most efficiently use a fixed budget to build a compact model. Through extensive experiments on six diverse tasks, we show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data to directly train a compact model (T5-Small). We further investigate how the optimal budget allocated towards computation varies across scenarios. We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.

[1]  Yiming Yang,et al.  Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , 2023, NeurIPS.

[2]  Alexander J. Ratner,et al.  Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , 2023, ACL.

[3]  Lucie Charlotte Magister,et al.  Teaching Small Language Models to Reason , 2022, ACL.

[4]  Alan Ritter,et al.  Stanceosaurus: Classifying Stance Towards Multicultural Misinformation , 2022, EMNLP.

[5]  Alan Ritter,et al.  Few-Shot Anaphora Resolution in Scientific Protocols via Mixtures of In-Context Experts , 2022, EMNLP.

[6]  Wei Xu,et al.  Improving Large-scale Paraphrase Acquisition and Generation , 2022, EMNLP.

[7]  Graham Neubig,et al.  Prompt Consistency for Zero-Shot Task Generalization , 2022, EMNLP.

[8]  Danqi Chen,et al.  Structured Pruning Learns Compact and Accurate Models , 2022, Annual Meeting of the Association for Computational Linguistics.

[9]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[10]  Julian McAuley,et al.  A Survey on Model Compression and Acceleration for Pretrained Language Models , 2022, AAAI.

[11]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[12]  Navid Rekabsaz,et al.  WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models , 2021, NAACL.

[13]  M. Lewis,et al.  Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models , 2021, NAACL.

[14]  Samuel R. Bowman,et al.  Clean or Annotate: How to Spend a Limited Data Collection Budget , 2021, DEEPLO.

[15]  Sebastian Riedel,et al.  A Few More Examples May Be Worth Billions of Parameters , 2021, EMNLP.

[16]  André F. T. Martins,et al.  Predicting Attention Sparsity in Transformers , 2021, SPNLP.

[17]  Alan Ritter,et al.  Pre-train or Annotate? Domain Adaptation with a Constrained Budget , 2021, EMNLP.

[18]  Chengyue Gong,et al.  Learning with Different Amounts of Annotation: From Zero to Many Labels , 2021, EMNLP.

[19]  Shuohang Wang,et al.  Want To Reduce Labeling Cost? GPT-3 Can Help , 2021, EMNLP.

[20]  Luke Zettlemoyer,et al.  Noisy Channel Language Model Prompting for Few-Shot Text Classification , 2021, ACL.

[21]  Andrew Gordon Wilson,et al.  Does Knowledge Distillation Really Work? , 2021, NeurIPS.

[22]  Julian McAuley,et al.  BERT Learns to Teach: Knowledge Distillation with Meta Learning , 2021, ACL.

[23]  Pradyumna Tambwekar,et al.  Towards a Comprehensive Understanding and Accurate Evaluation of Societal Biases in Pre-Trained Transformers , 2021, NAACL.

[24]  J. Malmaud,et al.  Pareto-Optimal Quantized ResNet Is Mostly 4-bit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25]  David R. So,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[26]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[27]  Omer Levy,et al.  How to Train BERT with an Academic Budget , 2021, EMNLP.

[28]  Regina Barzilay,et al.  Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence , 2021, NAACL.

[29]  Dan Hendrycks,et al.  CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review , 2021, NeurIPS Datasets and Benchmarks.

[30]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[31]  Tom Henighan,et al.  Scaling Laws for Transfer , 2021, ArXiv.

[32]  Sung-Hyon Myaeng,et al.  Handling Anomalies of Synthetic Questions in Unsupervised Question Answering , 2020, COLING.

[33]  Johan S. Obando-Ceron,et al.  Revisiting Rainbow: Promoting more insightful and inclusive deep reinforcement learning research , 2020, ICML.

[34]  Sung-Hyon Myaeng,et al.  Regularization of Distinct Strategies for Unsupervised Question Generation , 2020, FINDINGS.

[35]  Alan Ritter,et al.  WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols , 2020, WNUT.

[36]  Nitish Shirish Keskar,et al.  Unsupervised Paraphrasing with Pretrained Language Models , 2020, EMNLP.

[37]  Emily Denton,et al.  Characterising Bias in Compressed Models , 2020, ArXiv.

[38]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[39]  Ming-Wei Chang,et al.  Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[40]  Jianping Gou,et al.  Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[41]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[42]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[43]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[44]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[45]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[46]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[47]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[48]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[49]  Dan Roth,et al.  Partial Or Complete, That’s The Question , 2019, NAACL.

[50]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[52]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[53]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[54]  Jianfeng Gao,et al.  MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[55]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[56]  K. Lancaster A New Approach to Consumer Theory , 1966, Journal of Political Economy.

[57]  Andy Way,et al.  Knowledge Distillation for Sustainable Neural Machine Translation , 2022, AMTA.

[58]  A. Tefas,et al.  Knowledge distillation , 2022, Deep Learning for Robot Perception and Cognition.

[59]  Jin Wang,et al.  Knowledge Distillation with Reptile Meta-Learning for Pretrained Language Model Compression , 2022, COLING.

[60]  Hwaran Lee,et al.  Why Knowledge Distillation Amplifies Gender Bias and How to Mitigate from the Perspective of DistilBERT , 2022, GEBNLP.

[61]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[62]  P. Chiappori,et al.  A Theory of the Allocation of Time " , 2014 .