PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

The size of Transformer models is growing at an unprecedented rate. It has taken less than one year to reach trillion-level parameters since the release of GPT-3 (175B). Training such models requires both substantial engineering efforts and enormous computing resources, which are luxuries most research teams cannot afford. In this paper, we propose PipeTransformer, which leverages automated elastic pipelining for efficient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on SQuAD and GLUE datasets. Our results show that compared to the state-of-the-art baseline, PipeTransformer attains up to 2.83fold speedup without losing accuracy. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. Finally, we have modularized our training system with flexible APIs and made the source code publicly available.

[1]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[2]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[3]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability , 2017, NIPS.

[4]  Samy Bengio,et al.  Insights on representational similarity in neural networks with canonical correlation , 2018, NeurIPS.

[5]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[6]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[7]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[10]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Olatunji Ruwase,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, SC.

[13]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[16]  Shen Li,et al.  PyTorch distributed , 2020, Proc. VLDB Endow..

[17]  Yibo Zhu,et al.  A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[18]  Byung-Gon Chun,et al.  Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.

[19]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[20]  Jaesik Choi,et al.  HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism , 2020, USENIX ATC.