Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from 17× to 49×, while maintaining quality of 1.7× compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.

[1]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[2]  Daniel Matthew Cer,et al.  Language-agnostic BERT Sentence Embedding , 2020, ACL.

[3]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[4]  Xianyan Jia,et al.  M6: A Chinese Multimodal Pretrainer , 2021, ArXiv.

[5]  Li Dong,et al.  MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers , 2020, FINDINGS.

[6]  Yang Song,et al.  Extremely Small BERT Models from Mixed-Vocabulary Training , 2021, EACL.

[7]  Alena Fenogenova,et al.  RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark , 2020, EMNLP.

[8]  Yu Cheng,et al.  Contrastive Distillation on Intermediate Representations for Language Model Compression , 2020, EMNLP.

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[11]  Li Dong,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[12]  Mitchell A. Gordon,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, REPL4NLP.

[13]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[14]  Leonid Boytsov,et al.  SberQuAD - Russian Reading Comprehension Dataset: Description and Analysis , 2019, CLEF.

[15]  Yonglong Tian,et al.  Contrastive Representation Distillation , 2019, ICLR.

[16]  Gustavo Aguilar,et al.  Knowledge Distillation from Internal Representations , 2019, AAAI Conference on Artificial Intelligence.

[17]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[18]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[19]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[20]  Martin Andrews,et al.  Transformer to CNN: Label-scarce distillation for efficient text classification , 2019, ArXiv.

[21]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[22]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[23]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[24]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[25]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[26]  Valentin Khrulkov,et al.  Tensorized Embedding Layers for Efficient Model Compression , 2019, ArXiv.

[27]  Douwe Kiela,et al.  No Training Required: Exploring Random Encoders for Sentence Classification , 2019, ICLR.

[28]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Anna Rumshisky,et al.  RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian , 2018, COLING.

[31]  Varvara Logacheva,et al.  DeepPavlov: Open-Source Library for Dialogue Systems , 2018, ACL.

[32]  Elena Yagunova,et al.  ParaPhraser: Russian paraphrase corpus and shared task , 2017 .

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Natalia Loukachevitch,et al.  Two-stage approach in Russian named entity recognition , 2016, 2016 International FRUCT Conference on Intelligence, Social Media and Web (ISMW FRUCT).

[35]  Jörg Tiedemann,et al.  OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.

[36]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[37]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Ming Yang,et al.  Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[39]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[40]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[41]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[42]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.