论文信息 - Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions. Knowledge distillation has been shown to be a practical solution to reduce inference costs, but the distillation process itself requires significant computational resources. Rather than buying or renting GPUs to fine-tune, then distill a large model, an NLP practitioner might instead choose to allocate the available budget to hire annotators and manually label additional fine-tuning data. In this paper, we investigate how to most efficiently use a fixed budget to build a compact model. Through extensive experiments on six diverse tasks, we show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data to directly train a compact model (T5-Small). We further investigate how the optimal budget allocated towards computation varies across scenarios. We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.

Alan Ritter | Wei Xu | Junmo Kang

[1] Yiming Yang,et al. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , 2023, NeurIPS.

[2] Alexander J. Ratner,et al. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , 2023, ACL.

[3] Lucie Charlotte Magister,et al. Teaching Small Language Models to Reason , 2022, ACL.

[4] Alan Ritter,et al. Stanceosaurus: Classifying Stance Towards Multicultural Misinformation , 2022, EMNLP.

[5] Alan Ritter,et al. Few-Shot Anaphora Resolution in Scientific Protocols via Mixtures of In-Context Experts , 2022, EMNLP.

[6] Wei Xu,et al. Improving Large-scale Paraphrase Acquisition and Generation , 2022, EMNLP.

[7] Graham Neubig,et al. Prompt Consistency for Zero-Shot Task Generalization , 2022, EMNLP.

[8] Danqi Chen,et al. Structured Pruning Learns Compact and Accurate Models , 2022, Annual Meeting of the Association for Computational Linguistics.

[9] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[10] Julian McAuley,et al. A Survey on Model Compression and Acceleration for Pretrained Language Models , 2022, AAAI.

[11] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[12] Navid Rekabsaz,et al. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models , 2021, NAACL.

[13] M. Lewis,et al. Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models , 2021, NAACL.

[14] Samuel R. Bowman,et al. Clean or Annotate: How to Spend a Limited Data Collection Budget , 2021, DEEPLO.

[15] Sebastian Riedel,et al. A Few More Examples May Be Worth Billions of Parameters , 2021, EMNLP.

[16] André F. T. Martins,et al. Predicting Attention Sparsity in Transformers , 2021, SPNLP.

[17] Alan Ritter,et al. Pre-train or Annotate? Domain Adaptation with a Constrained Budget , 2021, EMNLP.

[18] Chengyue Gong,et al. Learning with Different Amounts of Annotation: From Zero to Many Labels , 2021, EMNLP.

[19] Shuohang Wang,et al. Want To Reduce Labeling Cost? GPT-3 Can Help , 2021, EMNLP.

[20] Luke Zettlemoyer,et al. Noisy Channel Language Model Prompting for Few-Shot Text Classification , 2021, ACL.

[21] Andrew Gordon Wilson,et al. Does Knowledge Distillation Really Work? , 2021, NeurIPS.

[22] Julian McAuley,et al. BERT Learns to Teach: Knowledge Distillation with Meta Learning , 2021, ACL.

[23] Pradyumna Tambwekar,et al. Towards a Comprehensive Understanding and Accurate Evaluation of Societal Biases in Pre-Trained Transformers , 2021, NAACL.

[24] J. Malmaud,et al. Pareto-Optimal Quantized ResNet Is Mostly 4-bit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[25] David R. So,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[26] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[27] Omer Levy,et al. How to Train BERT with an Academic Budget , 2021, EMNLP.

[28] Regina Barzilay,et al. Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence , 2021, NAACL.

[29] Dan Hendrycks,et al. CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review , 2021, NeurIPS Datasets and Benchmarks.

[30] D. Klein,et al. Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[31] Tom Henighan,et al. Scaling Laws for Transfer , 2021, ArXiv.

[32] Sung-Hyon Myaeng,et al. Handling Anomalies of Synthetic Questions in Unsupervised Question Answering , 2020, COLING.

[33] Johan S. Obando-Ceron,et al. Revisiting Rainbow: Promoting more insightful and inclusive deep reinforcement learning research , 2020, ICML.

[34] Sung-Hyon Myaeng,et al. Regularization of Distinct Strategies for Unsupervised Question Generation , 2020, FINDINGS.

[35] Alan Ritter,et al. WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols , 2020, WNUT.

[36] Nitish Shirish Keskar,et al. Unsupervised Paraphrasing with Pretrained Language Models , 2020, EMNLP.