BAGUA: Scaling up Distributed Learning with System Relaxations

Recently years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via “system relaxations”: quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build BAGUA, a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 1.95×) across a diverse range of tasks. Moreover, we conduct a rigorous tradeoff exploration showing that different algorithms and system relaxations achieve the best performance over different network conditions.

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  Yibo Zhu,et al.  A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[5]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[6]  Beng Chin Ooi,et al.  Rafiki: Machine Learning as an Analytics Service System , 2018, Proc. VLDB Endow..

[7]  Zhipeng Zhang,et al.  PS2: Parameter Server on Spark , 2019, SIGMOD Conference.

[8]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[9]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[10]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[11]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[12]  Zhipeng Zhang,et al.  MLlib*: Fast Training of GLMs Using Spark MLlib , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[13]  Jie Jiang,et al.  Angel: a new large-scale machine learning system , 2018 .

[14]  Yi Li,et al.  Mariana: Tencent Deep Learning Platform and its Applications , 2014, Proc. VLDB Endow..

[15]  Minjie Wang,et al.  Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.

[16]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[17]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[18]  Ji Liu,et al.  DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[19]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[20]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[21]  Nenghai Yu,et al.  Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[22]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[23]  Ali Taylan Cemgil,et al.  Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization , 2018, ICML.

[24]  Chris Jermaine,et al.  Tensor Relational Algebra for Distributed Machine Learning System Design , 2021, Proc. VLDB Endow..

[25]  Vladimir Braverman,et al.  Communication-efficient distributed SGD with Sketching , 2019, NeurIPS.

[26]  Anthony K. H. Tung,et al.  SINGA: A Distributed Deep Learning Platform , 2015, ACM Multimedia.

[27]  Li Fei-Fei,et al.  Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go? , 2018, ICML.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Farzin Haddadpour,et al.  Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization , 2019, NeurIPS.

[30]  Shirish Tatikonda,et al.  Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML , 2014, Proc. VLDB Endow..

[31]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[32]  Ruben Mayer,et al.  Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools , 2019 .

[33]  Michael I. Jordan,et al.  SparkNet: Training Deep Networks in Spark , 2015, ICLR.

[34]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[37]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[38]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[39]  Xiangru Lian,et al.  1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed , 2021, ICML.

[40]  Shen Li,et al.  PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers , 2021, ArXiv.

[41]  Fan Yang,et al.  FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[42]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[43]  Shirish Tatikonda,et al.  SystemML: Declarative Machine Learning on Spark , 2016, Proc. VLDB Endow..

[44]  Gang Chen,et al.  SINGA: Putting Deep Learning in the Hands of Multimedia Users , 2015, ACM Multimedia.

[45]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[46]  Hao Zhang,et al.  TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models , 2021, ICML.

[47]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[48]  Mani B. Srivastava,et al.  In-database Distributed Machine Learning: Demonstration using Teradata SQL Engine , 2019, Proc. VLDB Endow..

[49]  Nam Sung Kim,et al.  Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training , 2018, NeurIPS.

[50]  Kunle Olukotun,et al.  Understanding and optimizing asynchronous low-precision stochastic gradient descent , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[51]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[52]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[53]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[54]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[55]  Chuck Bear,et al.  Vertica-ML: Distributed Machine Learning in Vertica Database , 2020, SIGMOD Conference.

[56]  Rolf Rabenseifner,et al.  Optimization of Collective Reduction Operations , 2004, International Conference on Computational Science.

[57]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[58]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[59]  Asim Kadav,et al.  MALT: distributed data-parallelism for existing ML applications , 2015, EuroSys.

[60]  Tong Yang,et al.  SketchML: Accelerating Distributed Machine Learning with Data Sketches , 2018, SIGMOD Conference.

[61]  Chris Jermaine,et al.  Declarative Recursive Computation on an RDBMS, or, Why You Should Use a Database For Distributed Machine Learning , 2019, ArXiv.

[62]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[63]  Peter Richtárik,et al.  On Biased Compression for Distributed Learning , 2020, ArXiv.

[64]  Amar Phanishayee,et al.  Memory-Efficient Pipeline-Parallel DNN Training , 2021, ICML.

[65]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[66]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[67]  Sanjay Chawla,et al.  A Cost-based Optimizer for Gradient Descent Optimization , 2017, SIGMOD Conference.

[68]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[69]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[70]  Eric P. Xing,et al.  High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[71]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[72]  Mladen Kolar,et al.  Efficient Distributed Learning with Sparsity , 2016, ICML.

[73]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[74]  Yuan Qi,et al.  Asynchronous Distributed Variational Gaussian Process for Regression , 2017, ICML.

[75]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[76]  Ce Zhang,et al.  Distributed Learning Systems with First-Order Methods , 2020, Found. Trends Databases.

[77]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[78]  Jiawei Jiang,et al.  DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions , 2018, SIGMOD Conference.

[79]  Eric P. Xing,et al.  GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[80]  Jianyu Wang,et al.  Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD , 2018, MLSys.

[81]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[82]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[83]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[84]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[85]  Carsten Binnig,et al.  DB4ML - An In-Memory Database Kernel with Machine Learning Support , 2020, SIGMOD Conference.