PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
暂无分享,去创建一个
Myle Ott | Alban Desmaison | Liangchen Luo | Geeta Chauhan | R. Varma | Hamid Shojanazeri | Can Balioglu | Min Xu | Y. Hao | Sam Shleifer | Chien-chin Huang | A. Gu | Shen Li | Yanli Zhao | Less Wright | Bernard Nguyen
[1] Yang You,et al. Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management , 2021, IEEE Transactions on Parallel and Distributed Systems.
[2] Lawrence C. McAfee,et al. Reducing Activation Recomputation in Large Transformer Models , 2022, MLSys.
[3] Trishul M. Chilimbi,et al. MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud , 2022, Proc. VLDB Endow..
[4] D. Mudigere,et al. DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction , 2022, ArXiv.
[5] Joseph Gonzalez,et al. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , 2022, OSDI.
[6] Jianhua Sun,et al. Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training , 2022, USENIX Annual Technical Conference.
[7] Xiaodong Yi,et al. OneFlow: Redesign the Distributed Deep Learning Framework from Scratch , 2021, ArXiv.
[8] Feng Yan,et al. Gradient Compression Supercharged High-Performance Data Parallel DNN Training , 2021, SOSP.
[9] Noam Shazeer,et al. GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.
[10] Amar Phanishayee,et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] Xiaojie Jin,et al. DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.
[12] Hao Zhang,et al. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models , 2021, ICML.
[13] Shen Li,et al. PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers , 2021, ArXiv.
[14] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.
[15] Tianqi Chen,et al. Dynamic Tensor Rematerialization , 2020, International Conference on Learning Representations.
[16] Kiran Kumar Matam,et al. High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models , 2021, ArXiv.
[17] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[18] Dehao Chen,et al. Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training , 2020, ArXiv.
[19] Ildoo Kim,et al. torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models , 2020, ArXiv.
[20] Luis Ceze,et al. PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud , 2020, MLSys.
[21] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[22] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[23] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[24] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[25] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[26] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[27] Nikhil R. Devanur,et al. PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.
[28] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.
[29] Jacob Nelson,et al. IncBricks: Toward In-Network Computation with an In-Network Cache , 2017, ASPLOS.