PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

[1]  Yang You,et al.  Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management , 2021, IEEE Transactions on Parallel and Distributed Systems.

[2]  Lawrence C. McAfee,et al.  Reducing Activation Recomputation in Large Transformer Models , 2022, MLSys.

[3]  Trishul M. Chilimbi,et al.  MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud , 2022, Proc. VLDB Endow..

[4]  D. Mudigere,et al.  DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction , 2022, ArXiv.

[5]  Joseph Gonzalez,et al.  Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , 2022, OSDI.

[6]  Jianhua Sun,et al.  Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training , 2022, USENIX Annual Technical Conference.

[7]  Xiaodong Yi,et al.  OneFlow: Redesign the Distributed Deep Learning Framework from Scratch , 2021, ArXiv.

[8]  Feng Yan,et al.  Gradient Compression Supercharged High-Performance Data Parallel DNN Training , 2021, SOSP.

[9]  Noam Shazeer,et al.  GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[10]  Amar Phanishayee,et al.  Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Xiaojie Jin,et al.  DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[12]  Hao Zhang,et al.  TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models , 2021, ICML.

[13]  Shen Li,et al.  PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers , 2021, ArXiv.

[14]  Olatunji Ruwase,et al.  ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[15]  Tianqi Chen,et al.  Dynamic Tensor Rematerialization , 2020, International Conference on Learning Representations.

[16]  Kiran Kumar Matam,et al.  High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models , 2021, ArXiv.

[17]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[18]  Dehao Chen,et al.  Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training , 2020, ArXiv.

[19]  Ildoo Kim,et al.  torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models , 2020, ArXiv.

[20]  Luis Ceze,et al.  PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud , 2020, MLSys.

[21]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[22]  Samyam Rajbhandari,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[25]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[26]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[27]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.

[28]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[29]  Jacob Nelson,et al.  IncBricks: Toward In-Network Computation with an In-Network Cache , 2017, ASPLOS.