论文信息 - On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization - 字舞流文

On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization

We adopt and analyze a synchronous K-step averaging stochastic gradient descent algorithm which we call K-AVG for solving large scale machine learning problems. We establish the convergence results of K-AVG for nonconvex objectives. Our analysis of K-AVG applies to many existing variants of synchronous SGD. We explain why the K-step delay is necessary and leads to better performance than traditional parallel stochastic gradient descent which is equivalent to K-AVG with $K=1$. We also show that K-AVG scales better with the number of learners than asynchronous stochastic gradient descent (ASGD). Another advantage of K-AVG over ASGD is that it allows larger stepsizes and facilitates faster convergence. On a cluster of $128$ GPUs, K-AVG is faster than ASGD implementations and achieves better accuracies and faster convergence for training with the CIFAR-10 dataset.

Fan Zhou | Guojing Cong | Guojing Cong | Fan Zhou

[1] K. Chung. On a Stochastic Approximation Method , 1954 .

[2] J. Sacks. Asymptotic Distribution of Stochastic Approximation Procedures , 1958 .

[3] E. Parzen. Annals of Mathematical Statistics , 1962 .

[4] H. Robbins,et al. A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications , 1985 .

[5] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[6] L. Eon Bottou. Online Learning and Stochastic Approximations , 1998 .

[7] H. Robbins. A Stochastic Approximation Method , 1951 .

[8] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[9] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[10] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[11] Elad Hazan,et al. An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[12] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[13] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[14] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[15] Nicholas I. M. Gould,et al. SIAM Journal on Optimization , 2012 .

[16] Ohad Shamir,et al. Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[17] Ohad Shamir,et al. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[18] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[19] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[20] Qiang Chen,et al. Network In Network , 2013, ICLR.

[21] Yijun Huang,et al. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[22] Yann LeCun,et al. Deep learning with Elastic Averaging SGD , 2014, NIPS.

[23] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[25] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[26] Heike Freud,et al. On Line Learning In Neural Networks , 2016 .

[27] Ioannis Mitliagkas,et al. Parallel SGD: When does averaging help? , 2016, ArXiv.

[28] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[29] Nathan Srebro,et al. Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox , 2017, COLT.

[30] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..