Kronecker Decomposition for GPT Compression

GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success of GPT is mostly attributed to its pre-training on huge amount of data and its large number of parameters (from 100M to billions of parameters). Despite the superior performance of GPT (especially in few-shot or zero-shot setup), this overparameterized nature of GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. This problem can be mitigated using model compression techniques; however, compressing GPT models has not been investigated much in the literature. In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model. Our Kronecker GPT-2 model (KnGPT2) is initialized based on the Kronecker decomposed version of the GPT-2 model and then is undergone a very light pre-training on only a small portion of the training data with intermediate layer knowledge distillation (ILKD). Finally, our KnGPT2 is fine-tuned on down-stream tasks using ILKD as well. We evaluate our model on both language modeling and General Language Understanding Evaluation benchmark tasks and show that with more efficient pretraining and similar number of parameters, our KnGPT2 outperforms the existing DistilGPT2 model significantly.

[1]  Zhengxiao Du,et al.  GPT Understands, Too , 2021, AI Open.

[2]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[3]  Risto Lehtonen,et al.  Multilevel Statistical Models , 2005 .

[4]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[5]  Mehdi Rezagholizadeh,et al.  Fully Quantized Transformer for Machine Translation , 2020, EMNLP.

[6]  Qun Liu,et al.  TernaryBERT: Distillation-aware Ultra-low Bit BERT , 2020, EMNLP.

[7]  Jia-Nan Wu,et al.  Compression of fully-connected layer in neural network by Kronecker product , 2015, 2016 Eighth International Conference on Advanced Computational Intelligence (ICACI).

[8]  C. Loan The ubiquitous Kronecker product , 2000 .

[9]  Mehdi Rezagholizadeh,et al.  MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation , 2021, ACL.

[10]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[11]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  S. R. Searle,et al.  On the history of the kronecker product , 1983 .

[14]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[16]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[17]  Ali Ghodsi,et al.  KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation , 2021, ArXiv.

[18]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[19]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[20]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[21]  Matthew Mattina,et al.  Compressing RNNs for IoT devices by 15-38x using Kronecker Products , 2019, ArXiv.

[22]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[23]  Mehdi Rezagholizadeh,et al.  ALP-KD: Attention-Based Layer Projection for Knowledge Distillation , 2020, AAAI.

[24]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[25]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[26]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[27]  Furu Wei,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[28]  Ali Ghodsi,et al.  Annealing Knowledge Distillation , 2021, EACL.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30]  Mehdi Rezagholizadeh,et al.  Improving Word Embedding Factorization for Compression using Distilled Nonlinear Neural Decomposition , 2020, EMNLP.