暂无分享,去创建一个
Yang You | Yong Liu | Fuzhao Xue | Ziji Shi | Futao Wei | Yuxuan Lou | Yang You | Fuzhao Xue | Ziji Shi | Yong Liu | Futao Wei | Yuxuan Lou
[1] Chang Zhou,et al. Exploring Sparse Expert Models and Beyond , 2021, ArXiv.
[2] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[3] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[4] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[5] W. Bruce Croft,et al. BERT with History Answer Embedding for Conversational Question Answering , 2019, SIGIR.
[6] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.
[8] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[9] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[10] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[11] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[12] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[13] Yoshua Bengio,et al. Deep Learning of Representations: Looking Forward , 2013, SLSP.
[14] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[15] Chenhui Chu,et al. BERT Representations for Video Question Answering , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[16] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[17] Carlos Riquelme,et al. Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.
[18] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[19] Hao Zhang,et al. GDPNet: Refining Latent Multi-View Graph for Relation Extraction , 2020, AAAI.
[20] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[21] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[22] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[23] James Demmel,et al. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.
[24] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[25] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[26] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[27] Yang You,et al. Sequence Parallelism: Making 4D Parallelism Possible , 2021, ArXiv.
[28] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[29] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.
[30] Jiashi Feng,et al. Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.
[31] Tengyu Ma,et al. Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling , 2020, AAAI.
[32] An Embarrassingly Simple Model for Dialogue Relation Extraction , 2020, ArXiv.
[33] Furu Wei,et al. BERT-of-Theseus: Compressing BERT by Progressive Module Replacing , 2020, EMNLP.
[34] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.
[35] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, ArXiv.