On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization

We adopt and analyze a synchronous K-step averaging stochastic gradient descent algorithm which we call K-AVG  for solving large scale machine learning problems. We establish the convergence results of K-AVG for nonconvex objectives. Our analysis of K-AVG applies to many existing variants of synchronous SGD.  We explain why the K-step delay is necessary and leads to better performance than traditional parallel stochastic gradient descent which is equivalent to K-AVG with $K=1$. We also show that K-AVG scales better with the number of learners than asynchronous stochastic gradient descent (ASGD). Another advantage of K-AVG over ASGD is that it allows larger stepsizes and facilitates faster convergence. On a cluster of $128$ GPUs, K-AVG is faster than ASGD implementations and achieves better accuracies and faster convergence for training with the CIFAR-10 dataset.

[1]  K. Chung On a Stochastic Approximation Method , 1954 .

[2]  J. Sacks Asymptotic Distribution of Stochastic Approximation Procedures , 1958 .

[3]  E. Parzen Annals of Mathematical Statistics , 1962 .

[4]  H. Robbins,et al.  A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[5]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[6]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[7]  H. Robbins A Stochastic Approximation Method , 1951 .

[8]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[9]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[10]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[11]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[12]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[13]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[14]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[15]  Nicholas I. M. Gould,et al.  SIAM Journal on Optimization , 2012 .

[16]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[17]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[18]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[19]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[20]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[21]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[22]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[25]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[26]  Heike Freud,et al.  On Line Learning In Neural Networks , 2016 .

[27]  Ioannis Mitliagkas,et al.  Parallel SGD: When does averaging help? , 2016, ArXiv.

[28]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[29]  Nathan Srebro,et al.  Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox , 2017, COLT.

[30]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..