Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suf-fice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa’s source code is publicly available at https://github.com/alpa-projects/alpa .

[1]  Trishul M. Chilimbi,et al.  MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud , 2022, ArXiv.

[2]  Laurent El Shafey,et al.  Pathways: Asynchronous Distributed Dataflow for ML , 2022, MLSys.

[3]  Quoc V. Le,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[4]  Nipun Kwatra,et al.  Varuna: scalable, low-cost training of massive deep learning models , 2021, EuroSys.

[5]  Xiaodong Yi,et al.  OneFlow: Redesign the Distributed Deep Learning Framework from Scratch , 2021, ArXiv.

[6]  Feng Yan,et al.  Gradient Compression Supercharged High-Performance Data Parallel DNN Training , 2021, SOSP.

[7]  Noam Shazeer,et al.  GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[8]  Ion Stoica,et al.  ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training , 2021, ICML.

[9]  Amar Phanishayee,et al.  Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Hao Zhang,et al.  TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models , 2021, ICML.

[11]  Olatunji Ruwase,et al.  ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[12]  Chuan Wu,et al.  DAPPLE: a pipelined data parallel approach for training large models , 2020, PPoPP.

[13]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[14]  Tianqi Chen,et al.  Dynamic Tensor Rematerialization , 2020, International Conference on Learning Representations.

[15]  D. Narayanan,et al.  Memory-Efficient Pipeline-Parallel DNN Training , 2020, ICML.

[16]  Jakub Tarnawski,et al.  Piper: Multidimensional Planner for DNN Parallelization , 2021, NeurIPS.

[17]  Jidong Zhai,et al.  PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections , 2021, OSDI.

[18]  Zhiqiang Xie,et al.  Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.

[19]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[20]  Cody Hao Yu,et al.  Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[21]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[22]  Dehao Chen,et al.  Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training , 2020, ArXiv.

[23]  Richard J. Forrester,et al.  Computational Comparison of Exact Solution Methods for 0-1 Quadratic Programs: Recommendations for Practitioners , 2020, J. Appl. Math..

[24]  James Cheng,et al.  TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism , 2020, IEEE Transactions on Parallel and Distributed Systems.

[25]  Gu Jin,et al.  SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping , 2020, ASPLOS.

[26]  P. Abbeel,et al.  Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.

[27]  Samyam Rajbhandari,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Yibo Zhu,et al.  A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[29]  Hao Zhang,et al.  AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning , 2020, NeurIPS.

[30]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[32]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[33]  Alexander Aiken,et al.  TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[34]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[35]  Byung-Gon Chun,et al.  Automating System Configuration of Distributed Machine Learning , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[36]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[37]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[38]  Byung-Gon Chun,et al.  Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.

[39]  Minjie Wang,et al.  Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.

[40]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[41]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[42]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[43]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[44]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[45]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[46]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[47]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[48]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[49]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[50]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[51]  Shoaib Kamil,et al.  Distributed Halide , 2016, PPoPP.

[52]  Vahab S. Mirrokni,et al.  Distributed Balanced Partitioning via Linear Embedding , 2015, WSDM.

[53]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[54]  John Forrest,et al.  CBC User Guide , 2005 .