PartIR: Composing SPMD Partitioning Strategies for Machine Learning

Training of modern large neural networks (NN) requires a combination of parallelization strategies encompassing data, model, or optimizer sharding. When strategies increase in complexity, it becomes necessary for partitioning tools to be 1) expressive, allowing the composition of simpler strategies, and 2) predictable to estimate performance analytically. We present PartIR, our design for a NN partitioning system. PartIR is focused on an incremental approach to rewriting and is hardware-and-runtime agnostic. We present a simple but powerful API for composing sharding strategies and a simulator to validate them. The process is driven by high-level programmer-issued partitioning tactics, which can be both manual and automatic. Importantly, the tactics are specified separately from the model code, making them easy to change. We evaluate PartIR on several different models to demonstrate its predictability, expressibility, and ability to reach peak performance..

[1]  Andrew Zisserman,et al.  TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Lisa Anne Hendricks,et al.  Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining , 2023, EMNLP.

[3]  Lisa Anne Hendricks,et al.  Measuring Progress in Fine-grained Vision-and-Language Understanding , 2023, ACL.

[4]  Myle Ott,et al.  PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , 2023, Proc. VLDB Endow..

[5]  Peter C. Ma,et al.  TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings , 2023, ISCA.

[6]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7]  Blake A. Hechtman,et al.  Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models , 2022, ASPLOS.

[8]  Guillem Cucurull,et al.  Galactica: A Large Language Model for Science , 2022, ArXiv.

[9]  J. Dean,et al.  Efficiently Scaling Transformer Inference , 2022, ArXiv.

[10]  Norman A. Rink,et al.  Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR , 2022, ArXiv.

[11]  Lawrence C. McAfee,et al.  Reducing Activation Recomputation in Large Transformer Models , 2022, MLSys.

[12]  A. Aiken,et al.  DISTAL: the distributed tensor algebra compiler , 2022, PLDI.

[13]  Norman A. Rink,et al.  Automap: Towards Ergonomic Automated Parallelism for ML Models , 2021, ArXiv.

[14]  Daniel D. Johnson,et al.  Getting to the point: index sets and parallelism-preserving autodiff for pointful array programming , 2021, Proc. ACM Program. Lang..

[15]  Peter C. Ma,et al.  Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[16]  Noam Shazeer,et al.  GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[17]  A. Fitzgibbon,et al.  DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks , 2021, EuroMLSys@EuroSys.

[18]  Uday Bondhugula,et al.  MLIR: Scaling Compiler Infrastructure for Domain Specific Computation , 2021, 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[19]  Olatunji Ruwase,et al.  ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[20]  Rastislav Bodik,et al.  Fireiron: A Data-Movement-Aware Scheduling Language for GPUs , 2020, PACT.

[21]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[22]  S. Gorlatch,et al.  Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies , 2020, Proc. ACM Program. Lang..

[23]  Jun Yang,et al.  Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads , 2020, ArXiv.

[24]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[25]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[26]  D. Narayanan,et al.  Memory-Efficient Pipeline-Parallel DNN Training , 2020, ICML.

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Dehao Chen,et al.  Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training , 2020, ArXiv.

[29]  Jure Leskovec,et al.  Learning to Simulate Complex Physics with Graph Networks , 2020, ICML.

[30]  Hans-Joachim Wittmann,et al.  Force Fields , 2020, War at the Speed of Light.

[31]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[32]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[33]  Samyam Rajbhandari,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[35]  Travis M. Drucker,et al.  High Resolution Medical Image Analysis with Spatial Partitioning , 2019, ArXiv.

[36]  David Cox,et al.  Triton: an intermediate language and compiler for tiled neural network computations , 2019, MAPL@PLDI.

[37]  Alexandros Nikolaos Ziogas,et al.  Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures , 2019, SC.

[38]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, NeurIPS.

[39]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[40]  Matei Zaharia,et al.  Optimizing data-intensive computations in existing libraries with split annotations , 2018, SOSP.

[41]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[42]  Eddie Q. Yan,et al.  TVM: End-to-End Optimization Stack for Deep Learning , 2018, ArXiv.

[43]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[46]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[47]  Andy Davis,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Tensorflow: a System for Large-scale Machine Learning Tensorflow: a System for Large-scale Machine Learning , 2022 .

[48]  Sam Lindley,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[49]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI.

[50]  Eelco Visser,et al.  Building program optimizers with rewriting strategies , 1998, ICFP '98.

[51]  Alvaro Sanchez-Gonzalez,et al.  Simple GNN Regularisation for 3D Molecular Property Prediction and Beyond , 2022, ICLR.

[52]  Amar Phanishayee,et al.  Efficient Large-Scale Language Model Training on GPU Clusters , 2021, ArXiv.

[53]  This paper is included in the Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation. Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization , 2022 .