Delayed Stochastic Algorithms for Distributed Weakly Convex Optimization

This paper studies delayed stochastic algorithms for weakly convex optimization in a distributed network with workers connected to a master node. More specifically, we consider a structured stochastic weakly convex objective function which is the composition of a convex function and a smooth nonconvex function. Recently, Xu et al. 2022 showed that an inertial stochastic subgradient method converges at a rate of $\mathcal{O}(\tau/\sqrt{K})$, which suffers a significant penalty from the maximum information delay $\tau$. To alleviate this issue, we propose a new delayed stochastic prox-linear ($\texttt{DSPL}$) method in which the master performs the proximal update of the parameters and the workers only need to linearly approximate the inner smooth function. Somewhat surprisingly, we show that the delays only affect the high order term in the complexity rate and hence, are negligible after a certain number of $\texttt{DSPL}$ iterations. Moreover, to further improve the empirical performance, we propose a delayed extrapolated prox-linear ($\texttt{DSEPL}$) method which employs Polyak-type momentum to speed up the algorithm convergence. Building on the tools for analyzing $\texttt{DSPL}$, we also develop improved analysis of delayed stochastic subgradient method ($\texttt{DSGD}$). In particular, for general weakly convex problems, we show that convergence of $\texttt{DSGD}$ only depends on the expected delay.

[1]  Sebastian U. Stich,et al.  Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning , 2022, NeurIPS.

[2]  Blake E. Woodworth,et al.  Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays , 2022, NeurIPS.

[3]  Jie Chen,et al.  Distributed Stochastic Inertial-Accelerated Methods with Delayed Derivatives for Nonconvex Problems , 2021, SIAM J. Imaging Sci..

[4]  Amit Daniely,et al.  Asynchronous Stochastic Optimization Robust to Arbitrary Delays , 2021, NeurIPS.

[5]  Qi Deng,et al.  Minibatch and Momentum Model-based Methods for Stochastic Weakly Convex Optimization , 2021, NeurIPS.

[6]  Damek Davis,et al.  Low-Rank Matrix Recovery with Composite Optimization: Good Conditioning and Rapid Convergence , 2021, Foundations of Computational Mathematics.

[7]  Shahin Shahrampour,et al.  On Distributed Nonconvex Optimization: Projected Subgradient Method for Weakly Convex Problems in Networks , 2020, IEEE Transactions on Automatic Control.

[8]  Lin Xiao,et al.  Stochastic variance-reduced prox-linear algorithms for nonconvex composite optimization , 2020, Mathematical Programming.

[9]  Mikael Johansson,et al.  Convergence of a Stochastic Gradient Method with Momentum for Nonsmooth Nonconvex Optimization , 2020, ICML.

[10]  M. Papatriantafilou,et al.  MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[11]  Anthony Man-Cho So,et al.  Incremental Methods for Weakly Convex Optimization , 2019, ArXiv.

[12]  Dmitriy Drusvyatskiy,et al.  Low-Rank Matrix Recovery with Composite Optimization: Good Conditioning and Rapid Convergence , 2019, Found. Comput. Math..

[13]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[14]  Dmitriy Drusvyatskiy,et al.  Uniform Graphical Convergence of Subgradients in Nonconvex Optimization and Learning , 2018, Math. Oper. Res..

[15]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization under high-order growth , 2018, ArXiv.

[16]  Ohad Shamir,et al.  A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates , 2018, ALT.

[17]  Niao He,et al.  On the Convergence Rate of Stochastic Mirror Descent for Nonsmooth Nonconvex Optimization , 2018, 1806.04781.

[18]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[19]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[20]  Haihao Lu “Relative Continuity” for Non-Lipschitz Nonsmooth Convex Optimization Using Stochastic (or Deterministic) Mirror Descent , 2017, INFORMS Journal on Optimization.

[21]  Damek Davis,et al.  Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems , 2017, SIAM J. Optim..

[22]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[23]  Feng Ruan,et al.  Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval , 2017, Information and Inference: A Journal of the IMA.

[24]  Feng Ruan,et al.  Stochastic Methods for Composite and Weakly Convex Optimization Problems , 2017, SIAM J. Optim..

[25]  Nenghai Yu,et al.  Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[26]  Wotao Yin,et al.  On Nonconvex Decentralized Gradient Descent , 2016, IEEE Transactions on Signal Processing.

[27]  Alexander J. Smola,et al.  AdaDelay: Delay Adaptive Distributed Stochastic Optimization , 2016, AISTATS.

[28]  Dmitriy Drusvyatskiy,et al.  Efficiency of minimizing compositions of convex functions and smooth maps , 2016, Math. Program..

[29]  Ji Liu,et al.  Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.

[30]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[31]  Matthew J. Streeter,et al.  Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning , 2014, NIPS.

[32]  Hamid Reza Feyzmahdavian,et al.  A delayed proximal gradient method with linear convergence rate , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[33]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[34]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[35]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[36]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[37]  Stephen J. Wright,et al.  A proximal method for composite minimization , 2008, Mathematical Programming.

[38]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[39]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[40]  H. Robbins A Stochastic Approximation Method , 1951 .

[41]  Sai Praneeth Karimireddy The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Updates , 2020 .

[42]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[43]  Vivek S. Borkar,et al.  Distributed Asynchronous Incremental Subgradient Methods , 2001 .

[44]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[45]  R. Fletcher A model algorithm for composite nondifferentiable optimization problems , 1982 .