论文信息 - 2D-HRA: Two-Dimensional Hierarchical Ring-Based All-Reduce Algorithm in Large-Scale Distributed Machine Learning

2D-HRA: Two-Dimensional Hierarchical Ring-Based All-Reduce Algorithm in Large-Scale Distributed Machine Learning

Gradient synchronization, a process of communication among machines in large-scale distributed machine learning (DML), plays a crucial role in improving DML performance. Since the scale of distributed clusters is continuously expanding, state-of-the-art DML synchronization algorithms suffer from latency for thousands of GPUs. In this article, we propose 2D-HRA, a two-dimensional hierarchical ring-based all-reduce algorithm in large-scale DML. 2D-HRA combines the ring with more latency-optimal hierarchical methods, and synchronizes parameters on two dimensions to make full use of the bandwidth. Simulation results show that 2D-HRA can efficiently alleviate the high latency and accelerate the synchronization process in large-scale clusters. Compared with traditional algorithms (ring based), 2D-HRA achieves up to 76.9% reduction in gradient synchronization time in clusters of different scale.

Huaxi Gu | Xiaoshan Yu | Yunfeng Lu | Youhe Jiang

[1] Dan Li,et al. Impact of Network Topology on the Performance of DML: Theoretical Analysis and Practical Factors , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[2] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3] Shuai Wang,et al. HiPS: Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning , 2018, NetAI@SIGCOMM.

[4] Mohamed Wahib,et al. Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads , 2018, 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW).

[5] Raul Castro Fernandez,et al. Ako: Decentralised Deep Learning with Partial Gradient Exchange , 2016, SoCC.

[6] Rio Yokota,et al. Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[7] E. Xing,et al. Poseidon : An Efficient Communication Interface for Distributed Deep Learning on GPU Clusters , 2017 .

[8] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[9] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[10] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[11] Murat Demirbas,et al. Performance Analysis and Comparison of Distributed Machine Learning Systems , 2019, ArXiv.

[12] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[15] Pongsakorn U.-Chupala,et al. ImageNet/ResNet-50 Training in 224 Seconds , 2018, ArXiv.

[16] I JordanMichael,et al. Graphical Models, Exponential Families, and Variational Inference , 2008 .

[17] Xin Wang,et al. Neural Network Meets DCN: Traffic-driven Topology Adaptation with Deep Learning , 2018, SIGMETRICS.

[18] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[19] Seunghak Lee,et al. Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.