BaGuaLu: targeting brain scale pretrained models with over 37 million cores
暂无分享,去创建一个
Jie Tang | Wenguang Chen | Hongxia Yang | J. Qiu | Junyang Lin | Zhenbo Sun | Zixuan Ma | Jianwei Zhang | Jidong Zhai | Shizhi Tang | Haojie Wang | Liyan Zheng | Huan Cao | Shangkun Liu | Guanyu Feng | Jiaao He | Aohan Zeng | Xin Liu | Tianyu Zheng | Weimin Zheng | Jie Gao | Zeqiang Huang | Runxin Zhong | Tianhui Shi | Yuanwei Wang
[1] Naman Goyal,et al. BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.
[2] Xianyan Jia,et al. M6: A Chinese Multimodal Pretrainer , 2021, ArXiv.
[3] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..
[4] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.
[5] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[6] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[7] David T. Jones,et al. Improved protein structure prediction using potentials from deep learning , 2020, Nature.
[8] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[9] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[10] Catherine D. Schuman,et al. Exascale Deep Learning to Accelerate Cancer Research , 2019, 2019 IEEE International Conference on Big Data (Big Data).
[11] Tanvir Ahmed,et al. Adaptive Loss Scaling for Mixed Precision Training , 2019, ArXiv.
[12] Joshua Romero,et al. Exascale Deep Learning for Scientific Inverse Problems , 2019, ArXiv.
[13] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[14] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[15] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[16] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, NeurIPS.
[17] Catherine D. Schuman,et al. 167-PFlops Deep Learning for Electron Microscopy: From Learning Physics to Atomic Manipulation , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[19] Prabhat,et al. Exascale Deep Learning for Climate Analytics , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[20] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[21] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.
[22] Ioannis Mitliagkas,et al. Deep Learning at 15PF : Supervised and Semi-Supervised Classification for Scientific Data , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[24] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[25] Wei Ge,et al. The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.
[26] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.
[27] Charles E. Leiserson,et al. Fat-trees: Universal networks for hardware-efficient supercomputing , 1985, IEEE Transactions on Computers.
[28] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[29] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[30] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .