L-DQN: An Asynchronous Limited-Memory Distributed Quasi-Newton Method

This work proposes a distributed algorithm for solving empirical risk minimization problems, called L-DQN, under the master/worker communication model. L-DQN is a distributed limited-memory quasi-Newton method that supports asynchronous computations among the worker nodes. Our method is efficient both in terms of storage and communication costs, i.e., in every iteration, the master node and workers communicate vectors of size O(d), where d is the dimension of the decision variable, and the amount of memory required on each node is O(md), where m is an adjustable parameter. To our knowledge, this is the first distributed quasi-Newton method with provable global linear convergence guarantees in the asynchronous setting where delays between nodes are present. Numerical experiments are provided to illustrate the theory and the practical performance of our method.

[1]  James Demmel,et al.  Rethinking the Value of Asynchronous Solvers for Distributed Deep Learning , 2020, HPC Asia.

[2]  Georgios B. Giannakis,et al.  Communication-Efficient Distributed Learning via Lazily Aggregated Quantized Gradients , 2019, NeurIPS.

[3]  Aryan Mokhtari,et al.  DAve-QN: A Distributed Averaged Quasi-Newton Method with Local Superlinear Convergence Rate , 2019, AISTATS.

[4]  Albert S. Berahas,et al.  Quasi-Newton Methods for Deep Learning: Forget the Past, Just Sample , 2019, ArXiv.

[5]  Fred Roosta,et al.  DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization , 2019, NeurIPS.

[6]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[7]  Jérôme Malick,et al.  A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning , 2018, ICML.

[8]  Ali Taylan Cemgil,et al.  Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization , 2018, ICML.

[9]  Georgios B. Giannakis,et al.  LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning , 2018, NeurIPS.

[10]  Stephen J. Wright,et al.  A Distributed Quasi-Newton Algorithm for Empirical Risk Minimization with Nonsmooth Regularization , 2018, KDD.

[11]  Jorge Nocedal,et al.  A Progressive Batching L-BFGS Method for Machine Learning , 2018, ICML.

[12]  Weizhu Chen,et al.  DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization , 2017, J. Mach. Learn. Res..

[13]  Shusen Wang,et al.  GIANT: Globally Improved Approximate Newton Method for Distributed Optimization , 2017, NeurIPS.

[14]  Ermin Wei,et al.  Superlinearly convergent asynchronous distributed network newton method , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[15]  Aryan Mokhtari,et al.  IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate , 2017, SIAM J. Optim..

[16]  Aryan Mokhtari,et al.  Surpassing Gradient Descent Provably: A Cyclic Incremental Method with Linear Convergence Rate , 2016, SIAM J. Optim..

[17]  Asuman E. Ozdaglar,et al.  Global Convergence Rate of Proximal Incremental Aggregated Gradient Methods , 2016, SIAM J. Optim..

[18]  Fabian Pedregosa,et al.  ASAGA: Asynchronous Parallel SAGA , 2016, AISTATS.

[19]  Aryan Mokhtari,et al.  Decentralized Quasi-Newton Methods , 2016, IEEE Transactions on Signal Processing.

[20]  Asuman E. Ozdaglar,et al.  Convergence Rate of Incremental Gradient and Incremental Newton Methods , 2015, SIAM J. Optim..

[21]  Ming Yan,et al.  ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates , 2015, SIAM J. Sci. Comput..

[22]  Asuman E. Ozdaglar,et al.  On the Convergence Rate of Incremental Aggregated Gradient Algorithms , 2015, SIAM J. Optim..

[23]  Mark W. Schmidt,et al.  Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields , 2015, AISTATS.

[24]  Roummel F. Marcia,et al.  On Efficiently Computing the Eigenvalues of Limited-Memory Quasi-Newton Matrices , 2014, SIAM J. Matrix Anal. Appl..

[25]  Asuman E. Ozdaglar,et al.  A globally convergent incremental Newton method , 2014, Math. Program..

[26]  Aryan Mokhtari,et al.  Global convergence of online limited memory BFGS , 2014, J. Mach. Learn. Res..

[27]  Pascal Bianchi,et al.  A Coordinate Descent Primal-Dual Algorithm and Application to Distributed Asynchronous Optimization , 2014, IEEE Transactions on Automatic Control.

[28]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[29]  James T. Kwok,et al.  Asynchronous Distributed ADMM for Consensus Optimization , 2014, ICML.

[30]  Justin Domke,et al.  Finito: A faster, permutable incremental gradient method for big data problems , 2014, ICML.

[31]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[32]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[33]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[34]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[35]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[36]  Soumen Kanrar,et al.  Performance Measurement of the Heterogeneous Network , 2011, ArXiv.

[37]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[38]  Panayiotis E. Pintelas,et al.  A practical method for solving large-scale TRS , 2011, Optim. Lett..

[39]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[40]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[41]  Alfred O. Hero,et al.  A Convergent Incremental Gradient Method with a Constant Step Size , 2007, SIAM J. Optim..

[42]  Jorge Nocedal,et al.  A Numerical Study of the Limited Memory BFGS Method and the Truncated-Newton Method for Large Scale Optimization , 1991, SIAM J. Optim..

[43]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[44]  I. Olkin,et al.  Inequalities: Theory of Majorization and Its Applications , 1980 .

[45]  John E. Dennis,et al.  On the Local and Superlinear Convergence of Quasi-Newton Methods , 1973 .

[46]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[47]  Ingram Olkin,et al.  Inequalities: Theory of Majorization and Its Application , 1979 .

[48]  J. J. Moré,et al.  A Characterization of Superlinear Convergence and its Application to Quasi-Newton Methods , 1973 .

[49]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .