Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Transformer-based models pre-trained on large-scale corpora achieve state-of-the-art accuracy for natural language processing tasks, but are too resource-hungry and compute-intensive to suit low-capability devices or applications with strict latency requirements. One potential remedy is model compression, which has attracted extensive attention. This paper summarizes the branches of research on compressing Transformers, focusing on the especially popular BERT model. BERT's complex architecture means that a compression technique that is highly effective on one part of the model, e.g., attention layers, may be less successful on another part, e.g., fully connected layers. In this systematic study, we identify the state of the art in compression for each part of BERT, clarify current best practices for compressing large-scale Transformer models, and provide insights into the inner workings of various methods. Our categorization and analysis also shed light on promising future research directions for achieving a lightweight, accurate, and generic natural language processing model.

[1]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , 2019, ArXiv.

[2]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[3]  Yang Song,et al.  Extreme Language Model Compression with Optimal Subwords and Shared Projections , 2019, ArXiv.

[4]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[5]  Hongbo Deng,et al.  AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search , 2020, ArXiv.

[6]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[7]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[8]  Niranjan Balasubramanian,et al.  Faster and Just As Accurate: A Simple Decomposition for Transformer Models , 2019 .

[9]  Subhabrata Mukherjee,et al.  Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data , 2019, ArXiv.

[10]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Martin Andrews,et al.  Transformer to CNN: Label-scarce distillation for efficient text classification , 2019, ArXiv.

[13]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[16]  Yves Scherrer,et al.  Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, EMNLP.

[17]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[18]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Jimmy J. Lin,et al.  Natural Language Generation for Effective Knowledge Distillation , 2019, EMNLP.

[20]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[21]  Jungang Xu,et al.  A Survey on Neural Network Language Models , 2019, ArXiv.

[22]  Kevin Duh,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.

[23]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[24]  Yanzhi Wang,et al.  Reweighted Proximal Pruning for Large-Scale Language Representation , 2019, ArXiv.

[25]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[26]  Kurt Keutzer,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2020, AAAI.

[27]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[28]  Yin Yang,et al.  Fine-Grained Propaganda Detection with Fine-Tuned BERT , 2019, EMNLP.

[29]  Joyce Y. Chai,et al.  Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches , 2019 .

[30]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[31]  Xiyuan Zhang,et al.  D-NET: A Pre-Training and Fine-Tuning Framework for Improving the Generalization of Machine Reading Comprehension , 2019, EMNLP.

[32]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[33]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[34]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[35]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[36]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[37]  Wenan Zhou,et al.  A survey of word embeddings based on deep learning , 2019, Computing.

[38]  Laurent Besacier,et al.  Naver Labs Europe’s Systems for the Document-Level Generation and Translation Task at WNGT 2019 , 2019, EMNLP.

[39]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.