论文信息 - Contrastive Distillation on Intermediate Representations for Language Model Compression

Contrastive Distillation on Intermediate Representations for Language Model Compression

Existing language model compression methods mostly use a simple L2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CoDIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods.

[1] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[2] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[5] Eneko Agirre,et al. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[6] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[7] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[8] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9] Dahua Lin,et al. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination , 2018, ArXiv.

[10] Moshe Wasserblat,et al. Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[11] Anna Rumshisky,et al. Revealing the Dark Secrets of BERT , 2019, EMNLP.

[12] Chen Wang,et al. Supervised Contrastive Learning , 2020, NeurIPS.

[13] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[14] Jimmy J. Lin,et al. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[15] Mikhail Khodak,et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[16] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[17] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[18] Xing Fan,et al. Knowledge Distillation from Internal Representations , 2020, AAAI.

[19] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[20] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[21] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[22] Zhijian Liu,et al. Lite Transformer with Long-Short Range Attention , 2020, ICLR.