Dynamic Knowledge Distillation for Pre-trained Language Models
暂无分享,去创建一个
Xu Sun | Lei Li | Peng Li | Jie Zhou | Yankai Lin | Shuhuai Ren | Xu Sun | Jie Zhou | Peng Li | Yankai Lin | Lei Li | Shuhuai Ren
[1] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.
[2] Furu Wei,et al. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.
[3] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.
[4] Oren Etzioni,et al. Green AI , 2019, Commun. ACM.
[5] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.
[6] Ming-Wei Chang,et al. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , 2019, ArXiv.
[7] Furu Wei,et al. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing , 2020, EMNLP.
[8] Yoshua Bengio,et al. FitNets: Hints for Thin Deep Nets , 2014, ICLR.
[9] Andrew McCallum,et al. Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.
[10] H. Sebastian Seung,et al. Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.
[11] Ankur P. Parikh,et al. Thieves on Sesame Street! Model Extraction of BERT-based APIs , 2019, ICLR.
[12] Fangzhao Wu,et al. One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers , 2021, FINDINGS.
[13] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[14] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[15] Yoshua Bengio,et al. An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.
[16] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.
[17] Lawrence Carin,et al. MixKD: Towards Efficient Distillation of Large-scale Language Models , 2020, ICLR.
[18] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.
[19] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .
[20] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[21] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[22] Kilian Q. Weinberger,et al. On Calibration of Modern Neural Networks , 2017, ICML.
[23] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[24] Hongxia Yang,et al. KNAS: Green Neural Architecture Search , 2021, ICML.
[25] Lidia S. Chao,et al. Uncertainty-Aware Curriculum Learning for Neural Machine Translation , 2020, ACL.
[26] Seyed Iman Mirzadeh,et al. Improved Knowledge Distillation via Teacher Assistant , 2020, AAAI.
[27] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.
[28] Stefan Wrobel,et al. Active Hidden Markov Models for Information Extraction , 2001, IDA.
[29] Zoubin Ghahramani,et al. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.
[30] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.
[31] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[32] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.
[33] Burr Settles,et al. Active Learning Literature Survey , 2009 .
[34] Jie Zhou,et al. Accelerating Pre-trained Language Models via Calibrated Cascade , 2020, ArXiv.
[35] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[36] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.