论文信息 - Communication trade-offs for synchronized distributed SGD with large step size - 字舞流文

Communication trade-offs for synchronized distributed SGD with large step size

Synchronous mini-batch SGD is state-of-the-art for large-scale distributed machine learning. However, in practice, its convergence is bottlenecked by slow communication rounds between worker nodes. A natural solution to reduce communication is to use the \emph{`local-SGD'} model in which the workers train their model independently and synchronize every once in a while. This algorithm improves the computation-communication trade-off but its convergence is not understood very well. We propose a non-asymptotic error analysis, which enables comparison to \emph{one-shot averaging} i.e., a single communication round among independent workers, and \emph{mini-batch averaging} i.e., communicating at every step. We also provide adaptive lower bounds on the communication frequency for large step-sizes ($ t^{-\alpha} $, $ \alpha\in (1/2 , 1 ) $) and show that \emph{Local-SGD} reduces communication by a factor of $O\Big(\frac{\sqrt{T}}{P^{3/2}}\Big)$, with $T$ the total number of gradients and $P$ machines.

Aymeric Dieuleveut | Kumar Kshitij Patel

[1] Shenghuo Zhu,et al. Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication , 2018, ArXiv.

[2] Taghi M. Khoshgoftaar,et al. Large-scale distributed L-BFGS , 2017, Journal of Big Data.

[3] John C. Duchi,et al. Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[4] Yann LeCun,et al. Deep learning with Elastic Averaging SGD , 2014, NIPS.

[5] Ioannis Mitliagkas,et al. Parallel SGD: When does averaging help? , 2016, ArXiv.

[6] Fabian Pedregosa,et al. Improved asynchronous parallel optimization analysis for stochastic incremental methods , 2018, J. Mach. Learn. Res..

[7] Jonathan D. Rosenblatt,et al. On the Optimality of Averaging in Distributed Statistical Learning , 2014, 1407.2724.

[8] Yuchen Zhang,et al. Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss , 2015, ArXiv.

[9] Seunghak Lee,et al. On Model Parallelization and Scheduling Strategies for Distributed Machine Learning , 2014, NIPS.

[10] Alexander J. Smola,et al. Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[11] Mark W. Schmidt,et al. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[12] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[13] Gideon S. Mann,et al. Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[14] Dan Alistarh,et al. The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory , 2018, PODC.

[15] Wu-Jun Li,et al. Fast Asynchronous Parallel Stochastic Gradient Decent , 2015, ArXiv.

[16] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.

[17] David P. Woodruff,et al. Communication lower bounds for statistical estimation problems via a distributed data processing inequality , 2015, STOC.

[18] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[19] Patrice Marcotte,et al. Co-Coercivity and Its Role in the Convergence of Iterative Schemes for Solving Variational Inequalities , 1996, SIAM J. Optim..

[20] Tao Lin,et al. Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[21] Avleen Singh Bijral,et al. Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.

[22] V. Fabian. On Asymptotic Normality in Stochastic Approximation , 1968 .

[23] Sofiane Saadane,et al. On the rates of convergence of parallelized averaged stochastic gradient algorithms , 2017, Statistics.

[24] Kunle Olukotun,et al. Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[25] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[26] Laurent Massoulié,et al. Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[27] F. Bach,et al. NONPARAMETRIC STOCHASTIC APPROXIMATION WITH LARGE STEP-SIZES1 BY AYMERIC DIEULEVEUT , 2016 .

[28] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[29] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[30] Prateek Jain,et al. Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.

[31] Hamid Reza Feyzmahdavian,et al. An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, CDC.

[32] Fabian Pedregosa,et al. Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization , 2017, NIPS.

[33] Dimitris S. Papailiopoulos,et al. Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[34] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[35] Martin J. Wainwright,et al. Optimality guarantees for distributed statistical estimation , 2014, 1405.0782.

[36] Ali Taylan Cemgil,et al. Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization , 2018, ICML.

[37] Michael I. Jordan,et al. Adding vs. Averaging in Distributed Primal-Dual Optimization , 2015, ICML.

[38] Chih-Jen Lin,et al. A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems , 2015, ACM Trans. Intell. Syst. Technol..

[39] Todd Mytkowicz,et al. Parallel Stochastic Gradient Descent with Sound Combiners , 2017, ArXiv.

[40] Alexander J. Smola,et al. AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[41] Peter Richtárik,et al. Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[42] Janis Keuper,et al. Asynchronous parallel stochastic gradient descent: a numeric core for scalable distributed machine learning algorithms , 2015, MLHPC@SC.

[43] Ohad Shamir,et al. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[44] Ji Liu,et al. Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[45] Ohad Shamir,et al. Stochastic Convex Optimization , 2009, COLT.

[46] Alfred O. Hero,et al. Distributed principal component analysis on networks via directed graphical models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47] Wu-Jun Li,et al. Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee , 2016, AAAI.

[48] F. Bach,et al. Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[49] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[50] Alexander J. Smola,et al. Efficient mini-batch training for stochastic optimization , 2014, KDD.

[51] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[52] S. Gadat,et al. Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strong convexity , 2017, 1709.03342.

[53] Dan Alistarh,et al. The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning , 2016, ICML 2017.

[54] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[55] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[56] Francis R. Bach,et al. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[57] John C. Duchi,et al. Asynchronous stochastic convex optimization , 2015, 1508.00882.

[58] Gideon S. Mann,et al. Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[59] Prateek Jain,et al. Accelerating Stochastic Gradient Descent , 2017, ArXiv.

[60] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[61] Prateek Jain,et al. Accelerating Stochastic Gradient Descent , 2017, COLT.

[62] Martin Takác,et al. Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization? , 2015, ArXiv.

[63] Yijun Huang,et al. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[64] Dan Alistarh,et al. QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent , 2016, ArXiv.

[65] Martin J. Wainwright,et al. Information-theoretic lower bounds for distributed statistical estimation with communication constraints , 2013, NIPS.

[66] Tom Goldstein,et al. Efficient Distributed SGD with Variance Reduction , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[67] Sarit Khirirat,et al. Distributed learning with compressed gradients , 2018, 1806.06573.

[68] Wei Zhang,et al. Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[69] Saibal Mukhopadhyay,et al. On-chip training of recurrent neural networks with limited numerical precision , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[70] Blaise Agüera y Arcas,et al. Federated Learning of Deep Networks using Model Averaging , 2016, ArXiv.

[71] Xiaoqian Jiang,et al. Fast and Robust Parallel SGD Matrix Factorization , 2015, KDD.

[72] Dan Alistarh,et al. ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[73] John Langford,et al. Slow Learners are Fast , 2009, NIPS.

[74] Ohad Shamir,et al. Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[75] Tianbao Yang,et al. Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity , 2015 .

[76] Diego Klabjan,et al. A Stochastic Large-scale Machine Learning Algorithm for Distributed Features and Observations , 2018, ArXiv.

[77] Tong Zhang,et al. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[78] Zhihua Zhang,et al. Communication Lower Bounds for Distributed Convex Optimization: Partition Data on Features , 2017, AAAI.

[79] Michael I. Jordan,et al. Distributed optimization with arbitrary local solvers , 2015, Optim. Methods Softw..

[80] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[81] Michael I. Jordan,et al. CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[82] Alexander J. Smola,et al. On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[83] Ohad Shamir,et al. Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[84] Ohad Shamir,et al. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[85] Tong Zhang,et al. Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[86] Eric Moulines,et al. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[87] Francis R. Bach,et al. Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2015, AISTATS.

[88] Martin J. Wainwright,et al. Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[89] Thomas Paine,et al. GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training , 2013, ICLR.

[90] Bin Wu,et al. A Fast Distributed Stochastic Gradient Descent Algorithm for Matrix Factorization , 2014, BigMine.

[91] D. Ruppert,et al. Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[92] Chih-Jen Lin,et al. A fast parallel SGD for matrix factorization in shared memory systems , 2013, RecSys.

[93] Jakub Konecný,et al. Federated Optimization: Distributed Optimization Beyond the Datacenter , 2015, ArXiv.

[94] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[95] Yurii Nesterov,et al. Confidence level solutions for stochastic programming , 2000, Autom..

[96] Ohad Shamir,et al. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[97] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.