Augmentation with Projection: Towards an Effective and Efficient Data Augmentation Paradigm for Distillation

Knowledge distillation is one of the primary methods of transferring knowledge from large to small models. However, it requires massive task-specific data, which may not be plausible in many real-world applications. Data augmentation methods such as representation interpolation, token replacement, or augmentation with models are applied to tackle this problem. However, these data augmentation methods either potentially cause shifts in decision boundaries (representation interpolation), are not expressive enough (token replacement), or introduce too much computational overhead (augmentation with models). To this end, we propose AugPro (Augmentation with Projection), an effective and efficient data augmentation method for distillation. Our method builds on top of representation interpolation augmentation methods to maintain the diversity of expressions and converts the augmented data to tokens to avoid shifting decision boundaries. It uses simple operations that come with little computational overhead. The results on multiple GLUE tasks show that our methods can improve distillation performance by a large margin at a low time cost. Codes are available at https://github.com/google-research/google-research/tree/master/augpro.

[1]  Yuxiong He,et al.  Adversarial Data Augmentation for Task-Specific Knowledge Distillation of Pre-trained Transformers , 2022, AAAI.

[2]  Blake A. Hechtman,et al.  TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s , 2022, NeurIPS.

[3]  Zichao Yang,et al.  TreeMix: Compositional Constituency-based Data Augmentation for Natural Language Understanding , 2022, NAACL.

[4]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[5]  Zhilin Yang,et al.  FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning , 2021, ACL.

[6]  Meng Liao,et al.  Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction , 2021, ACL.

[7]  Kang Min Yoo,et al.  GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation , 2021, EMNLP.

[8]  Ramit Sawhney,et al.  Augmenting NLP models using Latent Feature Interpolations , 2020, COLING.

[9]  Lawrence Carin,et al.  MixKD: Towards Efficient Distillation of Large-scale Language Models , 2020, ICLR.

[10]  Philip S. Yu,et al.  Mixup-Transfomer: Dynamic Data Augmentation for NLP Tasks , 2020, ArXiv.

[11]  Zihan Wang,et al.  Attention-based Bidirectional Long Short-Term Memory Networks for Relation Classification Using Knowledge Distillation from BERT , 2020, 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech).

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  Hinrich Schütze,et al.  BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA , 2020, FINDINGS.

[14]  Diyi Yang,et al.  MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification , 2020, ACL.

[15]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[16]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[17]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[18]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[19]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[20]  Aditi Raghunathan,et al.  Adversarial Training Can Hurt Generalization , 2019, ArXiv.

[21]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[23]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[24]  Sosuke Kobayashi,et al.  Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations , 2018, NAACL.

[25]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[26]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[27]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[28]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[29]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  William Yang Wang,et al.  That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets , 2015, EMNLP.

[32]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[33]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[34]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[35]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[36]  Frederick Liu,et al.  EncT5: Fine-tuning T5 Encoder for Non-autoregressive Tasks , 2021, ArXiv.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.