BERT-of-Theseus: Compressing BERT by Progressive Module Replacing
暂无分享,去创建一个
Furu Wei | Ming Zhou | Canwen Xu | Tao Ge | Wangchunshu Zhou | Furu Wei | Ming Zhou | Canwen Xu | Wangchunshu Zhou | Tao Ge
[1] Ke Xu,et al. BERT Loses Patience: Fast and Robust Inference with Early Exit , 2020, NeurIPS.
[2] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.
[3] Di He,et al. Multilingual Neural Machine Translation with Knowledge Distillation , 2019, ICLR.
[4] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.
[5] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[6] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[7] Douwe Kiela,et al. SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.
[8] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.
[9] Xiaodong Liu,et al. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding , 2019, ArXiv.
[10] Michael Carbin,et al. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.
[11] Samuel R. Bowman,et al. Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? , 2020, ACL.
[12] Jimmy J. Lin,et al. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.
[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
[15] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.
[16] Ke Xu,et al. Scheduled DropHead: A Regularization Method for Transformer Models , 2020, EMNLP.
[17] Boaz Barak,et al. Deep double descent: where bigger models and more data hurt , 2019, ICLR.
[18] Xu Tan,et al. MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.
[19] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.
[20] Yoshua Bengio,et al. FitNets: Hints for Thin Deep Nets , 2014, ICLR.
[21] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.
[22] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.
[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[24] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.
[25] Ondrej Bojar,et al. Training Tips for the Transformer Model , 2018, Prague Bull. Math. Linguistics.
[26] Yiming Yang,et al. MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer , 2019 .
[27] René Vidal,et al. Curriculum Dropout , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[28] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[29] Kurt Keutzer,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2020, AAAI.
[30] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[31] Ming-Wei Chang,et al. Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .
[32] Zachary Chase Lipton,et al. Born Again Neural Networks , 2018, ICML.
[33] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[34] Misha Denil,et al. Predicting Parameters in Deep Learning , 2014 .
[35] Xiangyu Zhang,et al. Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[36] Xiangyu Zhang,et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[37] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[38] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.
[39] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[40] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[41] Ming Yang,et al. Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.
[42] Peter Bailis,et al. LIT: Learned Intermediate Representation Training for Model Compression , 2019, ICML.
[43] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.
[44] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[45] Zhongfei Zhang,et al. Doubly Convolutional Neural Networks , 2016, NIPS.
[46] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.
[47] Jason Weston,et al. Curriculum learning , 2009, ICML '09.
[48] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.