A Stochastic Second-Order Proximal Method for Distributed Optimization

We propose a distributed stochastic second-order proximal (St-SoPro) method that enables agents in a network to cooperatively minimize the sum of their local loss functions without any centralized coordination. St-SoPro incorporates a decentralized second-order approximation into an augmented Lagrangian function, and randomly samples the local gradients and Hessian matrices to update, so that it is efficient in solving large-scale problems. We show that for restricted strongly convex and smooth problems, the agents linearly converge in expectation to a neighborhood of the optimum, and the neighborhood can be arbitrarily small under proper parameter settings. Simulations over real machine learning datasets demonstrate that St-SoPro outperforms several state-of-the-art methods in terms of convergence speed as well as computation and communication costs.

[1]  Shi Pu,et al.  Improving the Transient Times for Distributed Stochastic Gradient Methods , 2021, IEEE Transactions on Automatic Control.

[2]  K. Johansson,et al.  A Primal-Dual SGD Algorithm for Distributed Nonconvex Optimization , 2020, IEEE/CAA Journal of Automatica Sinica.

[3]  I. Paschalidis,et al.  A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent , 2019, IEEE Transactions on Automatic Control.

[4]  Lianghao Ji,et al.  A Distributed Stochastic Proximal-Gradient Algorithm for Composite Optimization , 2021, IEEE Transactions on Control of Network Systems.

[5]  Tsung-Hui Chang,et al.  Distributed Stochastic Consensus Optimization With Momentum for Nonconvex Nonsmooth Problems , 2020, IEEE Transactions on Signal Processing.

[6]  Jie Lu,et al.  A Second-Order Proximal Algorithm for Consensus Optimization , 2020, IEEE Transactions on Automatic Control.

[7]  Angelia Nedic,et al.  Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.

[8]  Stefanie Jegelka,et al.  IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method , 2020, NeurIPS.

[9]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[10]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[11]  Wei Shi,et al.  Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs , 2016, SIAM J. Optim..

[12]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[13]  Wei Shi,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, SIAM J. Optim..

[14]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[15]  Bart De Schutter,et al.  Accelerated gradient methods and dual decomposition in distributed model predictive control , 2013, Autom..

[16]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[17]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[18]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[19]  Georgios B. Giannakis,et al.  Distributed Spectrum Sensing for Cognitive Radio Networks by Exploiting Sparsity , 2010, IEEE Transactions on Signal Processing.

[20]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[21]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[22]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[23]  Robert D. Tortora,et al.  Sampling: Design and Analysis , 2000 .