Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small. Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language. We also compare various aspects of distillations including the usage of word embeddings and unlabeled data augmentation. Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource (CPU, RAM, and storage) compared to pruned BERT models. We further propose some quick wins on performing KD to produce small NLP models via efficient KD training mechanisms involving simple choices of loss functions, word embeddings, and unlabeled data preparation.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Timothy Baldwin,et al.  IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP , 2020, COLING.

[5]  Graham Neubig,et al.  When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? , 2018, NAACL.

[6]  Tao Chen,et al.  Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN , 2017, Expert Syst. Appl..

[7]  Ayu Purwarianti,et al.  Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector , 2019, 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA).

[8]  Ayu Purwarianti,et al.  Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger , 2018, 2018 International Conference on Asian Language Processing (IALP).

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Georgios Tzimiropoulos,et al.  Knowledge distillation via softmax regression representation learning , 2021, ICLR.

[11]  Jimmy Lin,et al.  Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERT , 2020, RepL4NLP@ACL.

[12]  Ruslan Salakhutdinov,et al.  Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function , 2019, AAAI.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[15]  Alexander Lerchner,et al.  A Heuristic for Unsupervised Model Selection for Variational Disentangled Representation Learning , 2019, ICLR.

[16]  Naveen Arivazhagan,et al.  Small and Practical BERT Models for Sequence Labeling , 2019, EMNLP.

[17]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[18]  Alham Fikri Aji,et al.  Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging , 2018, 2018 International Conference on Asian Language Processing (IALP).

[19]  Phongtharin Vinayavekhin,et al.  Unifying Heterogeneous Classifiers With Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ruli Manurung,et al.  Designing an Indonesian part of speech tagset and manually tagged Indonesian corpus , 2014, 2014 International Conference on Asian Language Processing (IALP).

[21]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[22]  Se-Young Yun,et al.  Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation , 2021, IJCAI.

[23]  Glenn M. Fung,et al.  Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention , 2021, AAAI.

[24]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[25]  Mirna Adriani,et al.  Emotion Classification on Indonesian Twitter Dataset , 2018, 2018 International Conference on Asian Language Processing (IALP).

[26]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[27]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[30]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[32]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[33]  George Karypis,et al.  Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing , 2021, SUSTAINLP.

[34]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[35]  Masayu Leylia Khodra,et al.  Aspect and Opinion Terms Extraction Using Double Embeddings and Attention Mechanism for Indonesian Hotel Reviews , 2019, 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA).

[36]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[37]  Alexander M. Rush,et al.  Sequence-Level Knowledge Distillation , 2016, EMNLP.

[38]  Shuang Wu,et al.  Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[39]  Ayu Purwarianti,et al.  IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding , 2020, AACL.

[40]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[41]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[42]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[43]  Jingren Zhou,et al.  AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search , 2020, IJCAI.

[44]  Richard Johansson,et al.  Knowledge Distillation for Swedish NER models: A Search for Performance and Efficiency , 2021, NODALIDA.

[45]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[46]  Yin Yang,et al.  Compressing Large-Scale Transformer-Based Models: A Case Study on BERT , 2020, Transactions of the Association for Computational Linguistics.

[47]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[48]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[49]  Gu-Yeon Wei,et al.  Benchmarking TPU, GPU, and CPU Platforms for Deep Learning , 2019, ArXiv.

[50]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[51]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[52]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.