Communication Scheduling for Gossip SGD in a Wide Area Network

Deep neural networks (DNNs) achieve higher accuracy as the amount of training data increases. However, training data such as personal medical data are often privacy sensitive and cannot be collected. Methods have been proposed for training with distributed data that remain in a wide area network. Due to heterogeneity in a wide area network, methods based on synchronous communication, such as all-reduce stochastic gradient descent (SGD), are not suitable, and gossip SGD is promising because it is based on asynchronous communication. Communication time is a problem in a wide area network. Gossip SGD cannot use double buffering that is a technique for hiding the communication time, since gossip SGD uses an asynchronous communication method. In this paper, we propose a type of gossip SGD in which computation and communication overlap to accelerate learning. The proposed method shares newer models by scheduling communication. To schedule the communication, the nodes share the information of the estimated communication time and communication-enabled nodes. This method is effective in both homogeneous and heterogeneous networks. The experimental results using the CIFAR-100 and Fashion-MNIST datasets demonstrate the faster convergence of the proposed method.

[1]  Xuehai Qian,et al.  Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training , 2020, ASPLOS.

[2]  Kazuyuki Shudo,et al.  Addressing the Heterogeneity of A Wide Area Network for DNNs , 2021, 2021 IEEE 18th Annual Consumer Communications & Networking Conference (CCNC).

[3]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[4]  Michael V. McConnell,et al.  Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning , 2017, Nature Biomedical Engineering.

[5]  Xuehai Qian,et al.  Hop: Heterogeneity-aware Decentralized Training , 2019, ASPLOS.

[6]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[9]  Eric P. Xing,et al.  GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[10]  Forrest N. Iandola,et al.  How to scale distributed deep learning? , 2016, ArXiv.

[11]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[12]  Gauri Joshi,et al.  Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Abhinav Vishnu,et al.  GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent , 2018, ArXiv.

[14]  Takuya Akiba,et al.  Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.

[15]  Sungroh Yoon,et al.  DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters , 2016 .

[16]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  W. Bastiaan Kleijn,et al.  Edge-consensus Learning: Deep Learning on P2P Networks with Nonhomogeneous Data , 2020, KDD.

[18]  István Hegedüs,et al.  Gossip learning with linear models on fully distributed data , 2011, Concurr. Comput. Pract. Exp..

[19]  Matthieu Cord,et al.  Gossip training for deep learning , 2016, ArXiv.

[20]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[21]  Takuya Akiba,et al.  Chainer: A Deep Learning Framework for Accelerating the Research Cycle , 2019, KDD.

[22]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[23]  Shaohuai Shi,et al.  MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms , 2018, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[24]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[25]  Pradeep Dubey,et al.  Distributed Deep Learning Using Synchronous Stochastic Gradient Descent , 2016, ArXiv.

[26]  Dan Alistarh,et al.  Distributed Learning over Unreliable Networks , 2018, ICML.

[27]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[28]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[29]  Dhabaleswar K. Panda,et al.  S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.