论文信息 - A Short Study on Compressing Decoder-Based Language Models

A Short Study on Compressing Decoder-Based Language Models

Pre-trained Language Models (PLMs) have been successful for a wide range of natural language processing (NLP) tasks. The state-of-the-art of PLMs, however, are extremely large to be used on edge devices. As a result, the topic of model compression has attracted increasing attention in the NLP community. Most of the existing works focus on compressing encoder-based models (tiny-BERT, distilBERT, distilRoBERTa, etc), however, to the best of our knowledge, the compression of decoder-based models (such as GPT-2) has not been investigated much. Our paper aims to fill this gap. Specifically, we explore two directions: 1) we employ current state-of-the-art knowledge distillation techniques to improve fine-tuning of DistilGPT-2. 2) we pre-train a compressed GPT-2 model using layer truncation and compare it against the distillation-based method (DistilGPT2). The training time of our compressed model is significantly less than DistilGPT-2, but it can achieve better performance when fine-tuned on downstream tasks. We also demonstrate the impact of data cleaning on model performance.

[1] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[2] Zachary Chase Lipton,et al. Born Again Neural Networks , 2018, ICML.

[3] Mehdi Rezagholizadeh,et al. Not Far Away, Not So Close: Sample Efficient Nearest Neighbour Data Augmentation via MiniMax , 2021, FINDINGS.

[4] Mehdi Rezagholizadeh,et al. ALP-KD: Attention-Based Layer Projection for Knowledge Distillation , 2020, AAAI.

[5] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[6] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[7] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[8] Mehdi Rezagholizadeh,et al. Towards Zero-Shot Knowledge Distillation for Natural Language Processing , 2020, EMNLP.

[9] Olatunji Ruwase,et al. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[11] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[12] Mehdi Rezagholizadeh,et al. MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation , 2021, ACL.

[13] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[14] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[15] Jinwoo Shin,et al. Regularizing Class-Wise Predictions via Self-Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Manish Gupta,et al. Compression of Deep Learning Models for Text: A Survey , 2022, ACM Trans. Knowl. Discov. Data.

[17] Furu Wei,et al. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing , 2020, EMNLP.

[18] Preslav Nakov,et al. Poor Man's BERT: Smaller and Faster Transformer Models , 2020, ArXiv.

[19] Ali Ghodsi,et al. Annealing Knowledge Distillation , 2021, EACL.

[20] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[22] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[23] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[24] Pascal Poupart,et al. RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation , 2021, NAACL-HLT.

[25] Olatunji Ruwase,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.

[26] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[27] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[28] Rich Caruana,et al. Model compression , 2006, KDD '06.

[29] Yanjun Qi,et al. Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[30] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31] Ali Ghodsi,et al. How to Select One Among All? An Extensive Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding , 2021, ArXiv.

[32] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[33] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[34] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.