Exploring Sparse Expert Models and Beyond
暂无分享,去创建一个
Chang Zhou | Hongxia Yang | Jingren Zhou | Junyang Lin | Rui Men | Le Jiang | Yong Li | Jie Zhang | Wei Lin | An Yang | Jiamang Wang | Di Zhang | Lin Qu | Xianyan Jia | Ang Wang | Xianyan Jia | Jingren Zhou | Chang Zhou | Wei Lin | Yong Li | Hongxia Yang | J. Zhang | Rui Men | Junyang Lin | Le Jiang | Ang Wang | Lin Qu | Jiamang Wang | Dingyang Zhang | An Yang
[1] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[2] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[3] Kevin Barraclough,et al. I and i , 2001, BMJ : British Medical Journal.
[4] Naman Goyal,et al. BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.
[5] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[6] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[7] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.
[8] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.
[9] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[10] Zhilin Yang,et al. FastMoE: A Fast Mixture-of-Expert Training System , 2021, ArXiv.
[11] Jie Zhang,et al. Whale: A Unified Distributed Training Framework , 2020, ArXiv.
[12] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[13] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[14] Quoc V. Le,et al. Diversity and Depth in Per-Example Routing Models , 2018, ICLR.
[15] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[16] Ilya Sutskever,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.
[17] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[18] Xianyan Jia,et al. M6: A Chinese Multimodal Pretrainer , 2021, ArXiv.
[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Olatunji Ruwase,et al. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.
[21] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[22] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.
[23] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[24] Lei Zhang,et al. VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.
[25] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[26] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.
[27] Can Gao,et al. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2021, ACL/IJCNLP.
[28] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[29] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[30] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[31] Olatunji Ruwase,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.
[32] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[33] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[34] Marcus Rohrbach,et al. 12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.
[36] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[37] Hao Tian,et al. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.
[38] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[39] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[40] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.