暂无分享,去创建一个
Shen Li | Mahdi Soltanolkotabi | Chaoyang He | Salman Avestimehr | M. Soltanolkotabi | Chaoyang He | S. Avestimehr | Shen Li
[1] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.
[2] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[3] Jascha Sohl-Dickstein,et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.
[4] Samy Bengio,et al. Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.
[5] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[6] Liwei Wang,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.
[7] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[8] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[10] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[11] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[12] Olatunji Ruwase,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.
[13] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[14] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[15] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[16] Shen Li,et al. PyTorch distributed , 2020, Proc. VLDB Endow..
[17] Yibo Zhu,et al. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.
[18] Byung-Gon Chun,et al. Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.
[19] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[20] Jaesik Choi,et al. HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism , 2020, USENIX ATC.