暂无分享,去创建一个
Fandong Meng | Weiping Wang | Zheng Lin | Jie Zhou | Yuanxin Liu | Jie Zhou | Fandong Meng | Yuanxin Liu | Zheng Lin | Weiping Wang
[1] Jianping Gou,et al. Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.
[2] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[3] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.
[4] Yoshua Bengio,et al. FitNets: Hints for Thin Deep Nets , 2014, ICLR.
[5] Furu Wei,et al. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.
[6] Fengcheng Yuan,et al. ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques , 2021, AAAI.
[7] Ke Xu,et al. Self-Attention Attribution: Interpreting Information Interactions Inside Transformer , 2020, AAAI.
[8] Yuval Pinter,et al. Attention is not not Explanation , 2019, EMNLP.
[9] Qun Liu,et al. Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers , 2020, EMNLP.
[10] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[11] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.
[12] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.
[13] Furu Wei,et al. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers , 2021, FINDINGS.
[14] Qun Liu,et al. DynaBERT: Dynamic BERT with Adaptive Width and Depth , 2020, NeurIPS.
[15] Ruifeng Xu,et al. BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance , 2020, EMNLP.
[16] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[17] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[19] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[20] Yang Liu,et al. On Identifiability in Transformers , 2020, ICLR.
[21] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.
[22] Mehdi Rezagholizadeh,et al. ALP-KD: Attention-Based Layer Projection for Knowledge Distillation , 2020, AAAI.
[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[24] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.