BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

In this paper, we propose a novel model compression approach to effectively compress BERT by progressive module replacing. Our approach first divides the original BERT into several modules and builds their compact substitutes. Then, we randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules. We progressively increase the probability of replacement through the training. In this way, our approach brings a deeper level of interaction between the original and compact models. Compared to the previous knowledge distillation approaches for BERT compression, our approach does not introduce any additional loss function. Our approach outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.

[1]  Ke Xu,et al.  BERT Loses Patience: Fast and Robust Inference with Early Exit , 2020, NeurIPS.

[2]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[3]  Di He,et al.  Multilingual Neural Machine Translation with Knowledge Distillation , 2019, ICLR.

[4]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[8]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[9]  Xiaodong Liu,et al.  Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , 2019, ArXiv.

[10]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[11]  Samuel R. Bowman,et al.  Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? , 2020, ACL.

[12]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[15]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[16]  Ke Xu,et al.  Scheduled DropHead: A Regularization Method for Transformer Models , 2020, EMNLP.

[17]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[18]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[19]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[20]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[21]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[22]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[25]  Ondrej Bojar,et al.  Training Tips for the Transformer Model , 2018, Prague Bull. Math. Linguistics.

[26]  Yiming Yang,et al.  MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer , 2019 .

[27]  René Vidal,et al.  Curriculum Dropout , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[29]  Kurt Keutzer,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2020, AAAI.

[30]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[31]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[32]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[33]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[34]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[35]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[39]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[40]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[41]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[42]  Peter Bailis,et al.  LIT: Learned Intermediate Representation Training for Model Compression , 2019, ICML.

[43]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[44]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[45]  Zhongfei Zhang,et al.  Doubly Convolutional Neural Networks , 2016, NIPS.

[46]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[47]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[48]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.