Improving Task-Agnostic BERT Distillation with Layer Mapping Search

Knowledge distillation (KD) which transfers the knowledge from a large teacher model to a small student model, has been widely used to compress the BERT model recently. Besides the supervision in the output in the original KD, recent works show that layer-level supervision is crucial to the performance of the student BERT model. However, previous works designed the layer mapping strategy heuristically (e.g., uniform or last-layer), which can lead to inferior performance. In this paper, we propose to use the genetic algorithm (GA) to search for the optimal layer mapping automatically. To accelerate the search process, we further propose a proxy setting where a small portion of the training corpus are sampled for distillation, and three representative tasks are chosen for evaluation. After obtaining the optimal layer mapping, we perform the task-agnostic BERT distillation with it on the whole corpus to build a compact student model, which can be directly fine-tuned on downstream tasks. Comprehensive experiments on the evaluation benchmarks demonstrate that 1) layer mapping strategy has a significant effect on task-agnostic BERT distillation and different layer mappings can result in quite different performances; 2) the optimal layer mapping strategy from the proposed search process consistently outperforms the other heuristic ones; 3) with the optimal layer mapping, our student model achieves state-of-the-art performance on the GLUE tasks.

[1]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[2]  Quoc V. Le,et al.  The Evolved Transformer , 2019, ICML.

[3]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[4]  Ruifeng Xu,et al.  BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance , 2020, EMNLP.

[5]  Furu Wei,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[6]  Seyed Taghi Akhavan Niaki,et al.  Optimizing a hybrid vendor-managed inventory and transportation problem with fuzzy demand: An improved particle swarm optimization algorithm , 2014, Inf. Sci..

[7]  Avirup Sil,et al.  Structured Pruning of a BERT-based Question Answering Model , 2019 .

[8]  Furu Wei,et al.  BERT-of-Theseus: Compressing BERT by Progressive Module Replacing , 2020, EMNLP.

[9]  Preslav Nakov,et al.  Poor Man's BERT: Smaller and Faster Transformer Models , 2020, ArXiv.

[10]  Guangming Shi,et al.  Network pruning using sparse learning and genetic algorithm , 2020, Neurocomputing.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[13]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[14]  X. Yao Evolving Artificial Neural Networks , 1999 .

[15]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[16]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[17]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , 2019, ArXiv.

[18]  Qun Liu,et al.  Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads , 2020, AI Open.

[19]  Edouard Grave,et al.  Depth-Adaptive Transformer , 2020, ICLR.

[20]  Michael Georgiopoulos,et al.  Coupling weight elimination with genetic algorithms to reduce network size and preserve generalization , 1997, Neurocomputing.

[21]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[22]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[23]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[24]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[25]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[26]  Jian Qin,et al.  A dynamic chain-like agent genetic algorithm for global numerical optimization and feature selection , 2009, Neurocomputing.

[27]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[28]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[29]  Peng Zhou,et al.  FastBERT: a Self-distilling BERT with Adaptive Inference Time , 2020, ACL.

[30]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[31]  Quoc V. Le,et al.  Evolving Normalization-Activation Layers , 2020, NeurIPS.

[32]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[33]  Alan L. Yuille,et al.  Genetic CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[35]  Furu Wei,et al.  MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers , 2021, FINDINGS.

[36]  Dorothea Heiss-Czedik,et al.  An Introduction to Genetic Algorithms. , 1997, Artificial Life.

[37]  Ming Zhou,et al.  A Tensorized Transformer for Language Modeling , 2019, NeurIPS.

[38]  J. Scott McCarley,et al.  Pruning a BERT-based Question Answering Model , 2019, ArXiv.

[39]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[40]  Naveen Arivazhagan,et al.  Small and Practical BERT Models for Sequence Labeling , 2019, EMNLP.

[41]  Qun Liu,et al.  DynaBERT: Dynamic BERT with Adaptive Width and Depth , 2020, NeurIPS.

[42]  Kevin Duh,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.

[43]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[44]  Kazuyuki Murase,et al.  A new local search based hybrid genetic algorithm for feature selection , 2011, Neurocomputing.

[45]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[46]  Dan Meng,et al.  Extracting linguistic rules from data sets using fuzzy logic and genetic algorithms , 2012, Neurocomputing.

[47]  Wentao Ma,et al.  A Span-Extraction Dataset for Chinese Machine Reading Comprehension , 2019, EMNLP-IJCNLP.

[48]  Yingming Li,et al.  Fine-tune BERT with Sparse Self-Attention Mechanism , 2019, EMNLP.

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[50]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.