Decentralized Stochastic Gradient Tracking for Non-convex Empirical Risk Minimization

This paper studies a decentralized stochastic gradient tracking (DSGT) algorithm for non-convex empirical risk minimization problems over a peer-to-peer network of nodes, which is in sharp contrast to the existing DSGT only for convex problems. To ensure exact convergence and handle the variance among decentralized datasets, each node performs a stochastic gradient (SG) tracking step by using a mini-batch of samples, where the batch size is designed to be proportional to the size of the local dataset. We explicitly evaluate the convergence rate of DSGT with respect to the number of iterations in terms of algebraic connectivity of the network, mini-batch size, gradient variance, etc. Under certain conditions, we further show that DSGT has a network independence property in the sense that the network topology only affects the convergence rate up to a constant factor. Hence, the convergence rate of DSGT can be comparable to the centralized SGD method. Moreover, a linear speedup of DSGT with respect to the number of nodes is achievable for some scenarios. Numerical experiments for neural networks and logistic regression problems on CIFAR-10 finally illustrate the advantages of DSGT.

[1]  Christopher De Sa,et al.  Moniqua: Modulo Quantized Communication in Decentralized SGD , 2020, ICML.

[2]  Xiaojing Ye,et al.  Decentralized Consensus Algorithm with Delayed and Stochastic Gradients , 2016, SIAM J. Optim..

[3]  Angelia Nedic,et al.  Stochastic Gradient-Push for Strongly Convex Functions on Time-Varying Directed Graphs , 2014, IEEE Transactions on Automatic Control.

[4]  Ali H. Sayed,et al.  On the Performance of Exact Diffusion over Adaptive Networks , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[5]  Guanghui Lan,et al.  First-order and Stochastic Optimization Methods for Machine Learning , 2020 .

[6]  Michael G. Rabbat,et al.  Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization , 2017, Proceedings of the IEEE.

[7]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[8]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[9]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[10]  Ye Tian,et al.  Achieving Linear Convergence in Distributed Asynchronous Multiagent Optimization , 2018, IEEE Transactions on Automatic Control.

[11]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[12]  Wei Shi,et al.  Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs , 2016, SIAM J. Optim..

[13]  Jiaqi Zhang,et al.  Distributed Dual Gradient Tracking for Resource Allocation in Unbalanced Networks , 2020, IEEE Transactions on Signal Processing.

[14]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[15]  Pascal Bianchi,et al.  Performance of a Distributed Stochastic Approximation Algorithm , 2012, IEEE Transactions on Information Theory.

[16]  Don Towsley,et al.  Decentralized gradient methods: does topology matter? , 2020, AISTATS.

[17]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[18]  Na Li,et al.  Harnessing Smoothness to Accelerate Distributed Optimization , 2016, IEEE Transactions on Control of Network Systems.

[19]  Aryan Mokhtari,et al.  DSA: Decentralized Double Stochastic Averaging Gradient Algorithm , 2015, J. Mach. Learn. Res..

[20]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[21]  Soummya Kar,et al.  An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization , 2020, IEEE Transactions on Signal Processing.

[22]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[23]  Cheng Wu,et al.  Distributed Convex Optimization with Inequality Constraints over Time-Varying Unbalanced Digraphs , 2016, IEEE Transactions on Automatic Control.

[24]  Stephen P. Boyd,et al.  Fast linear iterations for distributed averaging , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[25]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[26]  Angelia Nedic,et al.  Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.

[27]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[28]  Laurent Massoulié,et al.  Asynchronous Accelerated Proximal Stochastic Gradient for Strongly Convex Distributed Finite Sums , 2019, ArXiv.

[29]  Michael G. Rabbat,et al.  Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[30]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[31]  Mingyi Hong,et al.  Distributed Learning in the Nonconvex World: From batch data to streaming and beyond , 2020, IEEE Signal Processing Magazine.

[32]  Ali H. Sayed,et al.  Variance-Reduced Stochastic Learning by Networked Agents Under Random Reshuffling , 2017, IEEE Transactions on Signal Processing.

[33]  Usman A. Khan,et al.  DEXTRA: A Fast Algorithm for Optimization Over Directed Graphs , 2017, IEEE Transactions on Automatic Control.

[34]  Jiaqi Zhang,et al.  Asynchronous Decentralized Optimization in Directed Networks , 2019, ArXiv.

[35]  Ali H. Sayed,et al.  On the Influence of Bias-Correction on Distributed Stochastic Optimization , 2019, IEEE Transactions on Signal Processing.

[36]  Anit Kumar Sahu,et al.  Distributed stochastic optimization with gradient tracking over strongly-connected networks , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[37]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[38]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[39]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[40]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[41]  Dan Alistarh,et al.  Distributed Learning over Unreliable Networks , 2018, ICML.

[42]  Martin Jaggi,et al.  Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.

[43]  Songtao Lu,et al.  GNSD: a Gradient-Tracking Based Nonconvex Stochastic Algorithm for Decentralized Optimization , 2019, 2019 IEEE Data Science Workshop (DSW).

[44]  Jiaqi Zhang,et al.  AsySPA: An Exact Asynchronous Algorithm for Convex Optimization Over Digraphs , 2018, IEEE Transactions on Automatic Control.

[45]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[46]  Lihua Xie,et al.  Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[47]  Usman A. Khan,et al.  A Linear Algorithm for Optimization Over Directed Graphs With Geometric Convergence , 2018, IEEE Control Systems Letters.

[48]  Martin J. Wainwright,et al.  Distributed Dual Averaging In Networks , 2010, NIPS.

[49]  Angelia Nedic,et al.  A Distributed Stochastic Gradient Tracking Method , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[50]  Ioannis Ch. Paschalidis,et al.  Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent , 2020, IEEE Signal Processing Magazine.

[51]  Karl Henrik Johansson,et al.  Distributed algebraic connectivity estimation for undirected graphs with upper and lower bounds , 2014, Autom..

[52]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[53]  Yi Zhou,et al.  Communication-efficient algorithms for decentralized and stochastic optimization , 2017, Mathematical Programming.

[54]  Aryan Mokhtari,et al.  Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication , 2018, ICML.

[55]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[56]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[57]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[58]  Michael G. Rabbat,et al.  Asynchronous Gradient Push , 2018, IEEE Transactions on Automatic Control.

[59]  Na Li,et al.  Accelerated Distributed Nesterov Gradient Descent , 2017, IEEE Transactions on Automatic Control.