Structured Pruning of a BERT-based Question Answering Model

The recent trend in industry-setting Natural Language Processing (NLP) research has been to operate large scale pretrained language models like BERT under strict computational limits. While most model compression work has focused on "distilling" a general-purpose language representation using expensive pretraining distillation, much less attention has been paid to creating smaller task-specific language representations which, arguably, are more useful in an industry setting. In this paper, we investigate compressing BERT- and RoBERTa-based question answering systems by structured pruning of parameters from the underlying trained transformer model. We find that an inexpensive combination of task-specific structured pruning and task-specific distillation, without the expense of pretraining distillation, yields highly-performing models across a range of speed/accuracy tradeoff operating points. We start from full-size models trained for SQuAD 2.0 or Natural Questions and introduce gates that allow selected parts of transformers to be individually eliminated. Specifically, we investigate (1) structured pruning to reduce the number of parameters in each transformer layer, (2) applicability to both BERT- and RoBERTa-based models, (3) applicability to both SQuAD 2.0 and Natural Questions, and (4) combining structured pruning with distillation. We find that pruning a combination of attention heads and the feed-forward layer yields a near-doubling of inference speed on SQuAD 2.0, with less than a 1.5 F1-point loss in accuracy. Furthermore, we find that a combination of distillation and structured pruning almost doubles the inference speed of RoBERTa-large based model for Natural Questions, while losing less than 0.5 F1-point on short answers.

[1]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[2]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[3]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[4]  Ziheng Wang,et al.  Structured Pruning of Large Language Models , 2019, EMNLP.

[5]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[6]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[7]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[8]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[9]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[10]  G P Shrivatsa Bhargav,et al.  Span Selection Pre-training for Question Answering , 2019, ACL.

[11]  Naveen Arivazhagan,et al.  Small and Practical BERT Models for Sequence Labeling , 2019, EMNLP.

[12]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[13]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[16]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[17]  Daxin Jiang,et al.  Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System , 2019, ArXiv.

[18]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[19]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[20]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[21]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[22]  Diederik P. Kingma,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[25]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[26]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[27]  Diederik P. Kingma,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.