On Second-order Optimization Methods for Federated Learning

We consider federated learning (FL), where the training data is distributed across a large number of clients. The standard optimization method in this setting is Federated Averaging (FedAvg), which performs multiple local first-order optimization steps between communication rounds. In this work, we evaluate the performance of several second-order distributed methods with local steps in the FL setting which promise to have favorable convergence properties. We (i) show that FedAvg performs surprisingly well against its second-order competitors when evaluated under fair metrics (equal amount of local computations)—in contrast to the results of previous work. Based on our numerical study, we propose (ii) a novel variant that uses second-order local information for updates and a global line search to counteract the resulting local specificity.

[1]  Xun Qian,et al.  FedNL: Making Newton-Type Methods Applicable to Federated Learning , 2021, ArXiv.

[2]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[3]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[4]  Fred Roosta,et al.  DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization , 2019, NeurIPS.

[5]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[6]  Shusen Wang,et al.  GIANT: Globally Improved Approximate Newton Method for Distributed Optimization , 2017, NeurIPS.

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  Tamer Basar,et al.  Distributed Adaptive Newton Methods with Globally Superlinear Convergence , 2020, Autom..

[9]  Thomas Hofmann,et al.  A Distributed Second-Order Algorithm You Can Trust , 2018, ICML.

[10]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[11]  Michael W. Mahoney,et al.  Distributed estimation of the inverse Hessian by determinantal averaging , 2019, NeurIPS.

[12]  Martin Jaggi,et al.  Adaptive balancing of gradient and update computation times using global geometry and approximate subproblems , 2018, AISTATS.

[13]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[14]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[15]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[16]  Kannan Ramchandran,et al.  Communication Efficient Distributed Approximate Newton Method , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[17]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[18]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[21]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[22]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[23]  Kannan Ramchandran,et al.  LocalNewton: Reducing Communication Bottleneck for Distributed Learning , 2021, ArXiv.

[24]  Anit Kumar Sahu,et al.  FedDANE: A Federated Newton-Type Method , 2019, 2019 53rd Asilomar Conference on Signals, Systems, and Computers.

[25]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[26]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[27]  Mark W. Schmidt,et al.  Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates , 2019, NeurIPS.

[28]  Martin Jaggi,et al.  Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning. , 2020, 2008.03606.

[29]  Ohad Shamir,et al.  Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[30]  Peter Richtárik,et al.  Distributed Second Order Methods with Fast Rates and Compressed Communication , 2021, ICML.

[31]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[32]  Ohad Shamir,et al.  The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication , 2021, COLT.