论文信息 - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning - 字舞流文

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism conﬁgurations. They do not suf-ﬁce to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efﬁcient parallel execution plans at each parallelism level. Alpa implements an efﬁcient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa’s source code is publicly available at https://github.com/alpa-projects/alpa .

Joseph Gonzalez | I. Stoica | Lianmin Zheng | Yanping Huang | Danyang Zhuo | Zhifeng Chen | Yuanzhong Xu | Hao Zhang | Yida Wang | Zhuohan Li | Yonghao Zhuang

[1] Trishul M. Chilimbi,et al. MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud , 2022, ArXiv.

[2] Laurent El Shafey,et al. Pathways: Asynchronous Distributed Dataflow for ML , 2022, MLSys.

[3] Quoc V. Le,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[4] Nipun Kwatra,et al. Varuna: scalable, low-cost training of massive deep learning models , 2021, EuroSys.

[5] Xiaodong Yi,et al. OneFlow: Redesign the Distributed Deep Learning Framework from Scratch , 2021, ArXiv.

[6] Feng Yan,et al. Gradient Compression Supercharged High-Performance Data Parallel DNN Training , 2021, SOSP.

[7] Noam Shazeer,et al. GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[8] Ion Stoica,et al. ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training , 2021, ICML.

[9] Amar Phanishayee,et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10] Hao Zhang,et al. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models , 2021, ICML.

[11] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[12] Chuan Wu,et al. DAPPLE: a pipelined data parallel approach for training large models , 2020, PPoPP.

[13] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[14] Tianqi Chen,et al. Dynamic Tensor Rematerialization , 2020, International Conference on Learning Representations.

[15] D. Narayanan,et al. Memory-Efficient Pipeline-Parallel DNN Training , 2020, ICML.

[16] Jakub Tarnawski,et al. Piper: Multidimensional Planner for DNN Parallelization , 2021, NeurIPS.

[17] Jidong Zhai,et al. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections , 2021, OSDI.

[18] Zhiqiang Xie,et al. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.

[19] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[20] Cody Hao Yu,et al. Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[21] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[22] Dehao Chen,et al. Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training , 2020, ArXiv.

[23] Richard J. Forrester,et al. Computational Comparison of Exact Solution Methods for 0-1 Quadratic Programs: Recommendations for Practitioners , 2020, J. Appl. Math..

[24] James Cheng,et al. TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism , 2020, IEEE Transactions on Parallel and Distributed Systems.

[25] Gu Jin,et al. SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping , 2020, ASPLOS.

[26] P. Abbeel,et al. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.

[27] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28] Yibo Zhu,et al. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[29] Hao Zhang,et al. AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning , 2020, NeurIPS.

[30] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[32] Yibo Zhu,et al. A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[33] Alexander Aiken,et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[34] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[35] Byung-Gon Chun,et al. Automating System Configuration of Distributed Machine Learning , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[36] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[37] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[38] Byung-Gon Chun,et al. Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.

[39] Minjie Wang,et al. Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.

[40] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[41] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[42] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[43] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[44] Michael I. Jordan,et al. Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[45] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[46] Samy Bengio,et al. Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[47] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[48] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.

[49] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[50] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[51] Shoaib Kamil,et al. Distributed Halide , 2016, PPoPP.

[52] Vahab S. Mirrokni,et al. Distributed Balanced Partitioning via Linear Embedding , 2015, WSDM.

[53] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[54] John Forrest,et al. CBC User Guide , 2005 .