论文信息 - On Second-order Optimization Methods for Federated Learning - 字舞流文

On Second-order Optimization Methods for Federated Learning

We consider federated learning (FL), where the training data is distributed across a large number of clients. The standard optimization method in this setting is Federated Averaging (FedAvg), which performs multiple local first-order optimization steps between communication rounds. In this work, we evaluate the performance of several second-order distributed methods with local steps in the FL setting which promise to have favorable convergence properties. We (i) show that FedAvg performs surprisingly well against its second-order competitors when evaluated under fair metrics (equal amount of local computations)—in contrast to the results of previous work. Based on our numerical study, we propose (ii) a novel variant that uses second-order local information for updates and a global line search to counteract the resulting local specificity.

Martin Jaggi | Sebastian U. Stich | Stephan Gunnemann | Sebastian Bischoff | Martin Jaggi | Stephan Gunnemann | Sebastian Bischoff | S. Stich

[1] Xun Qian,et al. FedNL: Making Newton-Type Methods Applicable to Federated Learning , 2021, ArXiv.

[2] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[3] Blaise Agüera y Arcas,et al. Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[4] Fred Roosta,et al. DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization , 2019, NeurIPS.

[5] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[6] Shusen Wang,et al. GIANT: Globally Improved Approximate Newton Method for Distributed Optimization , 2017, NeurIPS.

[7] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[8] Tamer Basar,et al. Distributed Adaptive Newton Methods with Globally Superlinear Convergence , 2020, Autom..

[9] Thomas Hofmann,et al. A Distributed Second-Order Algorithm You Can Trust , 2018, ICML.

[10] Sashank J. Reddi,et al. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[11] Michael W. Mahoney,et al. Distributed estimation of the inverse Hessian by determinantal averaging , 2019, NeurIPS.

[12] Martin Jaggi,et al. Adaptive balancing of gradient and update computation times using global geometry and approximate subproblems , 2018, AISTATS.

[13] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[14] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[15] Richard Nock,et al. Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[16] Kannan Ramchandran,et al. Communication Efficient Distributed Approximate Newton Method , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[17] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[18] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .

[19] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20] Ohad Shamir,et al. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[21] Michael I. Jordan,et al. CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[22] John Langford,et al. A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[23] Kannan Ramchandran,et al. LocalNewton: Reducing Communication Bottleneck for Distributed Learning , 2021, ArXiv.

[24] Anit Kumar Sahu,et al. FedDANE: A Federated Newton-Type Method , 2019, 2019 53rd Asilomar Conference on Signals, Systems, and Computers.

[25] Alexander J. Smola,et al. AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[26] L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[27] Mark W. Schmidt,et al. Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates , 2019, NeurIPS.

[28] Martin Jaggi,et al. Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning. , 2020, 2008.03606.

[29] Ohad Shamir,et al. Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[30] Peter Richtárik,et al. Distributed Second Order Methods with Fast Rates and Compressed Communication , 2021, ICML.

[31] Yuchen Zhang,et al. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[32] Ohad Shamir,et al. The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication , 2021, COLT.