2D-HRA: Two-Dimensional Hierarchical Ring-Based All-Reduce Algorithm in Large-Scale Distributed Machine Learning

Gradient synchronization, a process of communication among machines in large-scale distributed machine learning (DML), plays a crucial role in improving DML performance. Since the scale of distributed clusters is continuously expanding, state-of-the-art DML synchronization algorithms suffer from latency for thousands of GPUs. In this article, we propose 2D-HRA, a two-dimensional hierarchical ring-based all-reduce algorithm in large-scale DML. 2D-HRA combines the ring with more latency-optimal hierarchical methods, and synchronizes parameters on two dimensions to make full use of the bandwidth. Simulation results show that 2D-HRA can efficiently alleviate the high latency and accelerate the synchronization process in large-scale clusters. Compared with traditional algorithms (ring based), 2D-HRA achieves up to 76.9% reduction in gradient synchronization time in clusters of different scale.

[1]  Dan Li,et al.  Impact of Network Topology on the Performance of DML: Theoretical Analysis and Practical Factors , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Shuai Wang,et al.  HiPS: Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning , 2018, NetAI@SIGCOMM.

[4]  Mohamed Wahib,et al.  Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads , 2018, 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW).

[5]  Raul Castro Fernandez,et al.  Ako: Decentralised Deep Learning with Partial Gradient Exchange , 2016, SoCC.

[6]  Rio Yokota,et al.  Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[7]  E. Xing,et al.  Poseidon : An Efficient Communication Interface for Distributed Deep Learning on GPU Clusters , 2017 .

[8]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[9]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[11]  Murat Demirbas,et al.  Performance Analysis and Comparison of Distributed Machine Learning Systems , 2019, ArXiv.

[12]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[15]  Pongsakorn U.-Chupala,et al.  ImageNet/ResNet-50 Training in 224 Seconds , 2018, ArXiv.

[16]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[17]  Xin Wang,et al.  Neural Network Meets DCN: Traffic-driven Topology Adaptation with Deep Learning , 2018, SIGMETRICS.

[18]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[19]  Seunghak Lee,et al.  Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.