论文信息 - LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning - 字舞流文

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

This paper presents a new class of gradient methods for distributed machine learning that adaptively skip the gradient calculations to learn with reduced communication and computation. Simple rules are designed to detect slowly-varying gradients and, therefore, trigger the reuse of outdated gradients. The resultant gradient-based algorithms are termed Lazily Aggregated Gradient --- justifying our acronym LAG used henceforth. Theoretically, the merits of this contribution are: i) the convergence rate is the same as batch gradient descent in strongly-convex, convex, and nonconvex smooth cases; and, ii) if the distributed datasets are heterogeneous (quantified by certain measurable constants), the communication rounds needed to achieve a targeted accuracy are reduced thanks to the adaptive reuse of lagged gradients. Numerical experiments on both synthetic and real data corroborate a significant communication reduction compared to alternatives.

Georgios B. Giannakis | Wotao Yin | Tao Sun | Tianyi Chen | G. Giannakis | W. Yin | Tianyi Chen | Tao Sun

[1] Kevin Baker,et al. Classification of radar returns from the ionosphere using neural networks , 1989 .

[2] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[3] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[4] Le Song,et al. Supervised feature selection via dependence estimation , 2007, ICML '07.

[5] Asuman E. Ozdaglar,et al. Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[6] Ming Yan,et al. ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates , 2015, SIAM J. Sci. Comput..

[7] Blaise Agüera y Arcas,et al. Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[8] Mark W. Schmidt,et al. Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[9] Qing Ling,et al. Decentralized learning for wireless communications and networking , 2015, ArXiv.

[10] Ohad Shamir,et al. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[11] Ameet Talwalkar,et al. Federated Multi-Task Learning , 2017, NIPS.

[12] Damek Davis,et al. Convergence Rate Analysis of Several Splitting Schemes , 2014, 1406.4834.

[13] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[14] Alexander J. Smola,et al. Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[15] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[16] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[17] Yi Zhou,et al. Communication-efficient algorithms for decentralized and stochastic optimization , 2017, Mathematical Programming.

[18] H. Robbins. A Stochastic Approximation Method , 1951 .

[19] Alejandro Ribeiro,et al. Consensus in Ad Hoc WSNs With Noisy Links—Part I: Distributed Estimation of Deterministic Signals , 2008, IEEE Transactions on Signal Processing.

[20] Thomas Hofmann,et al. Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[21] Alfred O. Hero,et al. A Convergent Incremental Gradient Method with a Constant Step Size , 2007, SIAM J. Optim..

[22] Asuman E. Ozdaglar,et al. On the Convergence Rate of Incremental Aggregated Gradient Algorithms , 2015, SIAM J. Optim..

[23] Yuchen Zhang,et al. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[24] Yun Yang,et al. Communication-Efficient Distributed Statistical Inference , 2016, Journal of the American Statistical Association.

[25] H. Altay Güvenir,et al. Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals , 1998, Artif. Intell. Medicine.

[26] Randy H. Katz,et al. A Berkeley View of Systems Challenges for AI , 2017, ArXiv.

[27] Qing Ling,et al. Asynchronous periodic event-triggered coordination of multi-agent systems , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[28] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[29] Ron Kohavi,et al. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[30] Yijun Huang,et al. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[31] D. Rubinfeld,et al. Hedonic housing prices and the demand for clean air , 1978 .

[32] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[33] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[34] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[35] Ananda Theertha Suresh,et al. Distributed Mean Estimation with Limited Communication , 2016, ICML.

[36] Mark W. Schmidt,et al. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[37] Martin J. Wainwright,et al. Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[38] Francisco Facchinei,et al. Asynchronous Parallel Algorithms for Nonconvex Big-Data Optimization: Model and Convergence , 2016, ArXiv.

[39] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[40] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[41] Xin Wang,et al. Learning and Management for Internet of Things: Accounting for Adaptivity and Scalability , 2018, Proceedings of the IEEE.

[42] Michael I. Jordan,et al. Distributed optimization with arbitrary local solvers , 2015, Optim. Methods Softw..

[43] Wotao Yin,et al. Asynchronous Coordinate Descent under More Realistic Assumptions , 2017, NIPS.

[44] Stephen J. Wright,et al. An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..