Staleness-Aware Async-SGD for Distributed Deep Learning

Deep neural networks have been shown to achieve state-of-the-art performance in several machine learning tasks. Stochastic Gradient Descent (SGD) is the preferred optimization algorithm for training these networks and asynchronous SGD (ASGD) has been widely adopted for accelerating the training of large-scale deep networks in a distributed computing environment. However, in practice it is quite challenging to tune the training hyperparameters (such as learning rate) when using ASGD so as achieve convergence and linear speedup, since the stability of the optimization algorithm is strongly influenced by the asynchronous nature of parameter updates. In this paper, we propose a variant of the ASGD algorithm in which the learning rate is modulated according to the gradient staleness and provide theoretical guarantees for convergence of this algorithm. Experimental verification is performed on commonly-used image classification benchmarks: CIFAR10 and Imagenet to demonstrate the superior effectiveness of the proposed approach, compared to SSGD (Synchronous SGD) and the conventional ASGD algorithm.

[1]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[2]  John N. Tsitsiklis,et al.  Distributed asynchronous deterministic and stochastic gradient optimization algorithms , 1986 .

[3]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[4]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[5]  Mukesh Singhal,et al.  Logical Time: Capturing Causality in Distributed Systems , 1996, Computer.

[6]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[7]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[8]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[9]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[10]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[11]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[12]  W. Marsden I and J , 2012 .

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[15]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  William Chan,et al.  Distributed asynchronous optimization of convolutional neural networks , 2014, INTERSPEECH.

[17]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[18]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[19]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[20]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[21]  James T. Kwok,et al.  Fast Distributed Asynchronous SGD with Variance Reduction , 2015, ArXiv.

[22]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[23]  Stephen J. Wright,et al.  Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties , 2014, SIAM J. Optim..

[24]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[25]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[26]  Shengen Yan,et al.  Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[27]  Suyog Gupta,et al.  Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[28]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[29]  Suyog Gupta,et al.  Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study , 2015, Industrial Conference on Data Mining.