Communication trade-offs for synchronized distributed SGD with large step size

Synchronous mini-batch SGD is state-of-the-art for large-scale distributed machine learning. However, in practice, its convergence is bottlenecked by slow communication rounds between worker nodes. A natural solution to reduce communication is to use the \emph{`local-SGD'} model in which the workers train their model independently and synchronize every once in a while. This algorithm improves the computation-communication trade-off but its convergence is not understood very well. We propose a non-asymptotic error analysis, which enables comparison to \emph{one-shot averaging} i.e., a single communication round among independent workers, and \emph{mini-batch averaging} i.e., communicating at every step. We also provide adaptive lower bounds on the communication frequency for large step-sizes ($ t^{-\alpha} $, $ \alpha\in (1/2 , 1 ) $) and show that \emph{Local-SGD} reduces communication by a factor of $O\Big(\frac{\sqrt{T}}{P^{3/2}}\Big)$, with $T$ the total number of gradients and $P$ machines.

[1]  Shenghuo Zhu,et al.  Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication , 2018, ArXiv.

[2]  Taghi M. Khoshgoftaar,et al.  Large-scale distributed L-BFGS , 2017, Journal of Big Data.

[3]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[4]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[5]  Ioannis Mitliagkas,et al.  Parallel SGD: When does averaging help? , 2016, ArXiv.

[6]  Fabian Pedregosa,et al.  Improved asynchronous parallel optimization analysis for stochastic incremental methods , 2018, J. Mach. Learn. Res..

[7]  Jonathan D. Rosenblatt,et al.  On the Optimality of Averaging in Distributed Statistical Learning , 2014, 1407.2724.

[8]  Yuchen Zhang,et al.  Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss , 2015, ArXiv.

[9]  Seunghak Lee,et al.  On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[10]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[11]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[12]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[13]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[14]  Dan Alistarh,et al.  The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory , 2018, PODC.

[15]  Wu-Jun Li,et al.  Fast Asynchronous Parallel Stochastic Gradient Decent , 2015, ArXiv.

[16]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[17]  David P. Woodruff,et al.  Communication lower bounds for statistical estimation problems via a distributed data processing inequality , 2015, STOC.

[18]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[19]  Patrice Marcotte,et al.  Co-Coercivity and Its Role in the Convergence of Iterative Schemes for Solving Variational Inequalities , 1996, SIAM J. Optim..

[20]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[21]  Avleen Singh Bijral,et al.  Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.

[22]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[23]  Sofiane Saadane,et al.  On the rates of convergence of parallelized averaged stochastic gradient algorithms , 2017, Statistics.

[24]  Kunle Olukotun,et al.  Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[25]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[26]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[27]  F. Bach,et al.  NONPARAMETRIC STOCHASTIC APPROXIMATION WITH LARGE STEP-SIZES1 BY AYMERIC DIEULEVEUT , 2016 .

[28]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[29]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[30]  Prateek Jain,et al.  Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[31]  Hamid Reza Feyzmahdavian,et al.  An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, CDC.

[32]  Fabian Pedregosa,et al.  Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization , 2017, NIPS.

[33]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[34]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[35]  Martin J. Wainwright,et al.  Optimality guarantees for distributed statistical estimation , 2014, 1405.0782.

[36]  Ali Taylan Cemgil,et al.  Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization , 2018, ICML.

[37]  Michael I. Jordan,et al.  Adding vs. Averaging in Distributed Primal-Dual Optimization , 2015, ICML.

[38]  Chih-Jen Lin,et al.  A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems , 2015, ACM Trans. Intell. Syst. Technol..

[39]  Todd Mytkowicz,et al.  Parallel Stochastic Gradient Descent with Sound Combiners , 2017, ArXiv.

[40]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[41]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[42]  Janis Keuper,et al.  Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms , 2015, MLHPC@SC.

[43]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[44]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[45]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[46]  Alfred O. Hero,et al.  Distributed principal component analysis on networks via directed graphical models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Wu-Jun Li,et al.  Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee , 2016, AAAI.

[48]  F. Bach,et al.  Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[49]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[50]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[51]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[52]  S. Gadat,et al.  Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strong convexity , 2017, 1709.03342.

[53]  Dan Alistarh,et al.  The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning , 2016, ICML 2017.

[54]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[55]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[56]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[57]  John C. Duchi,et al.  Asynchronous stochastic convex optimization , 2015, 1508.00882.

[58]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[59]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, ArXiv.

[60]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[61]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, COLT.

[62]  Martin Takác,et al.  Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization? , 2015, ArXiv.

[63]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[64]  Dan Alistarh,et al.  QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent , 2016, ArXiv.

[65]  Martin J. Wainwright,et al.  Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[66]  Tom Goldstein,et al.  Efficient Distributed SGD with Variance Reduction , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[67]  Sarit Khirirat,et al.  Distributed learning with compressed gradients , 2018, 1806.06573.

[68]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[69]  Saibal Mukhopadhyay,et al.  On-chip training of recurrent neural networks with limited numerical precision , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[70]  Blaise Agüera y Arcas,et al.  Federated Learning of Deep Networks using Model Averaging , 2016, ArXiv.

[71]  Xiaoqian Jiang,et al.  Fast and Robust Parallel SGD Matrix Factorization , 2015, KDD.

[72]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[73]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[74]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[75]  Tianbao Yang,et al.  Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity , 2015 .

[76]  Diego Klabjan,et al.  A Stochastic Large-scale Machine Learning Algorithm for Distributed Features and Observations , 2018, ArXiv.

[77]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[78]  Zhihua Zhang,et al.  Communication Lower Bounds for Distributed Convex Optimization: Partition Data on Features , 2017, AAAI.

[79]  Michael I. Jordan,et al.  Distributed optimization with arbitrary local solvers , 2015, Optim. Methods Softw..

[80]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[81]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[82]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[83]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[84]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[85]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[86]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[87]  Francis R. Bach,et al.  Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2015, AISTATS.

[88]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[89]  Thomas Paine,et al.  GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training , 2013, ICLR.

[90]  Bin Wu,et al.  A Fast Distributed Stochastic Gradient Descent Algorithm for Matrix Factorization , 2014, BigMine.

[91]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[92]  Chih-Jen Lin,et al.  A fast parallel SGD for matrix factorization in shared memory systems , 2013, RecSys.

[93]  Jakub Konecný,et al.  Federated Optimization: Distributed Optimization Beyond the Datacenter , 2015, ArXiv.

[94]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[95]  Yurii Nesterov,et al.  Confidence level solutions for stochastic programming , 2000, Autom..

[96]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[97]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.