1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB’s Convergence Speed

To train large models (like BERT and GPT-3) with hundreds or even thousands of GPUs, the communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP interconnects network. On one side large-batch optimization such as LAMB algorithm was proposed to reduce the number of communications. On the other side, communication compression algorithms such as 1-bit SGD and 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially on low-bandwidth Ethernet networks. Motivated by this we aim to combine the power of large-batch optimization and communication compression, but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed. In addition, we introduce a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance compared to existing MPI-based implementation. For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6× communication volume reduction, up to 2.8× end-to-end speedup (in terms of number of training samples per second), and the same convergence speed (in terms of number of pre-training samples to reach the same accuracy on fine-tuning tasks) compared to uncompressed LAMB.

[1]  Martin Jaggi,et al.  PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning , 2020, NeurIPS.

[2]  Minjia Zhang,et al.  Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping , 2020, NeurIPS.

[3]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[4]  Yongjian Wu,et al.  Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters , 2020, MLSys.

[5]  Martin Jaggi,et al.  Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.

[6]  Kamyar Azizzadenesheli,et al.  signSGD with Majority Vote is Communication Efficient and Fault Tolerant , 2018, ICLR.

[7]  Peng Jiang,et al.  A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.

[8]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[9]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[10]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[11]  Eduard A. Gorbunov,et al.  Linearly Converging Error Compensated SGD , 2020, NeurIPS.

[12]  Nam Sung Kim,et al.  Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training , 2018, NeurIPS.

[13]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[14]  Georgios B. Giannakis,et al.  Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients , 2019, NeurIPS.

[15]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Shaohuai Shi,et al.  A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).

[18]  Christopher Ré,et al.  Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care , 2015, NIPS.

[19]  Indranil Gupta,et al.  CSER: Communication-efficient SGD with Error Reset , 2020, NeurIPS.

[20]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[21]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[22]  Vladimir Braverman,et al.  Communication-efficient distributed SGD with Sketching , 2019, NeurIPS.

[23]  Anastasios Kyrillidis,et al.  Compressing Gradient Optimizers via Count-Sketches , 2019, ICML.

[24]  Min Ye,et al.  Communication-Computation Efficient Gradient Coding , 2018, ICML.

[25]  James T. Kwok,et al.  Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback , 2019, NeurIPS.

[26]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[27]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[28]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[29]  Sanjiv Kumar,et al.  cpSGD: Communication-efficient and differentially-private distributed SGD , 2018, NeurIPS.

[30]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[31]  Le Trieu Phong,et al.  Distributed SGD With Flexible Gradient Compression , 2020, IEEE Access.

[32]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[33]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[34]  Nenghai Yu,et al.  Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[35]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[36]  Longbo Huang,et al.  Double Quantization for Communication-Efficient Distributed Optimization , 2018, NeurIPS.

[37]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[38]  Xiangru Lian,et al.  1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed , 2021, ICML.

[39]  Aryan Mokhtari,et al.  Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication , 2018, ICML.