One Student Knows All Experts Know: From Sparse to Dense
暂无分享,去创建一个
[1] Zhiyuan Liu,et al. Knowledge Inheritance for Pre-trained Language Models , 2021, ArXiv.
[2] Carlos Riquelme,et al. Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.
[3] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[4] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[5] Ammar Ahmad Awan,et al. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale , 2022 .
[6] Andrew M. Dai,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ArXiv.
[7] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[8] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[9] Zhiyuan Liu,et al. CPM-2: Large-scale Cost-effective Pre-trained Language Models , 2021, AI Open.
[10] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[11] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[12] Maosong Sun,et al. MoEfication: Conditional Computation of Transformer Models for Efficient Inference , 2021, ArXiv.
[13] Yu Sun,et al. ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.
[14] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.
[15] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[16] Zangwei Zheng,et al. Sparse-MLP: A Fully-MLP Architecture with Conditional Computation , 2021, ArXiv.
[17] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[18] Jiashi Feng,et al. Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.
[19] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[20] Zornitsa Kozareva,et al. Efficient Large Scale Language Modeling with Mixtures of Experts , 2021, ArXiv.
[21] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[23] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[24] Ann L. Brown,et al. How people learn: Brain, mind, experience, and school. , 1999 .
[25] Yang You,et al. Go Wider Instead of Deeper , 2021, AAAI.
[26] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[27] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[28] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[29] Pat Langley,et al. Crafting Papers on Machine Learning , 2000, ICML.
[30] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[31] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[32] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[33] Naman Goyal,et al. BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.
[34] Patrick H. Chen,et al. DRONE: Data-aware Low-rank Compression for Large NLP Models , 2021, NeurIPS.