FedDRO: Federated Compositional Optimization for Distributionally Robust Learning

Recently, compositional optimization (CO) has gained popularity because of its applications in distributionally robust optimization (DRO) and many other machine learning problems. Large-scale and distributed availability of data demands the development of efficient federated learning (FL) algorithms for solving CO problems. Developing FL algorithms for CO is particularly challenging because of the compositional nature of the objective. Moreover, current state-of-the-art methods to solve such problems rely on large batch gradients (depending on the solution accuracy) not feasible for most practical settings. To address these challenges, in this work, we propose efficient FedAvg-type algorithms for solving non-convex CO in the FL setting. We first establish that vanilla FedAvg is not suitable to solve distributed CO problems because of the data heterogeneity in the compositional objective at each client which leads to the amplification of bias in the local compositional gradient estimates. To this end, we propose a novel FL framework FedDRO that utilizes the DRO problem structure to design a communication strategy that allows FedAvg to control the bias in the estimation of the compositional gradient. A key novelty of our work is to develop solution accuracy-independent algorithms that do not require large batch gradients (and function evaluations) for solving federated CO problems. We establish $\mathcal{O}(\epsilon^{-2})$ sample and $\mathcal{O}(\epsilon^{-3/2})$ communication complexity in the FL setting while achieving linear speedup with the number of clients. We corroborate our theoretical findings with empirical studies on large-scale DRO problems.

[1]  Bingzhe Wu,et al.  Learning with Noisy Labels over Imbalanced Subpopulations , 2022, ArXiv.

[2]  E. Bai,et al.  Stochastic Constrained DRO with a Complexity Independent of Sample Size , 2022, ArXiv.

[3]  Davoud Ataee Tarzanagh,et al.  FEDNEST: Federated Bilevel, Minimax, and Compositional Optimization , 2022, ICML.

[4]  Mohammad Mahdi Kamani,et al.  Learning Distributionally Robust Models at Scale via Composite Optimization , 2022, ICLR.

[5]  Volkan Cevher,et al.  On the Complexity of a Practical Primal-Dual Coordinate Method , 2022, ArXiv.

[6]  Heng Huang,et al.  Compositional Federated Learning: Applications in Distributionally Robust Averaging and Meta Learning , 2021, ArXiv.

[7]  Pramod K. Varshney,et al.  STEM: A Stochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning , 2021, NeurIPS.

[8]  Stephen J. Wright,et al.  Variance Reduction via Primal-Dual Accelerated Dual Averaging for Nonsmooth Convex Finite-Sums , 2021, ICML.

[9]  Qi Qi,et al.  Attentional Biased Stochastic Gradient for Imbalanced Classification , 2020, ArXiv.

[10]  John C. Duchi,et al.  Large-Scale Methods for Distributionally Robust Optimization , 2020, NeurIPS.

[11]  Wotao Yin,et al.  Solving Stochastic Compositional Optimization is Nearly as Easy as Solving Stochastic Optimization , 2020, IEEE Transactions on Signal Processing.

[12]  Sashank J. Reddi,et al.  Mime: Mimicking Centralized Stochastic Algorithms in Federated Learning , 2020, ArXiv.

[13]  Nathan Srebro,et al.  Minibatch vs Local SGD for Heterogeneous Distributed Learning , 2020, NeurIPS.

[14]  Wotao Yin,et al.  FedPD: A Federated Learning Framework With Adaptivity to Non-IID Data , 2020, IEEE Transactions on Signal Processing.

[15]  Pramod K. Varshney,et al.  Parallel Restarted SPIDER - Communication Efficient Distributed Nonconvex Optimization with Optimal Computation Complexity , 2019, ArXiv.

[16]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[17]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[18]  Tianbao Yang,et al.  A Simple and Effective Framework for Pairwise Deep Metric Learning , 2019, ECCV.

[19]  Xiang Li,et al.  On the Convergence of FedAvg on Non-IID Data , 2019, ICLR.

[20]  Junyu Zhang,et al.  A Stochastic Composite Gradient Method with Incremental Variance Reduction , 2019, NeurIPS.

[21]  Stefanie Jegelka,et al.  Distributionally Robust Optimization and Generalization in Kernel Methods , 2019, NeurIPS.

[22]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[23]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[24]  Tianbao Yang,et al.  Stochastic Primal-Dual Algorithms with Faster Convergence than O(1/√T) for Problems without Bilinear Structure , 2019, ArXiv.

[25]  Anit Kumar Sahu,et al.  Federated Optimization in Heterogeneous Networks , 2018, MLSys.

[26]  Saeed Ghadimi,et al.  A Single Timescale Stochastic Approximation Method for Nested Stochastic Optimization , 2018, SIAM J. Optim..

[27]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[28]  Yangyang Xu,et al.  Primal-Dual Stochastic Gradient Method for Convex Programs with Many Functional Constraints , 2018, SIAM J. Optim..

[29]  Ioannis Ch. Paschalidis,et al.  A Robust Learning Approach for Regression Models Based on Distributionally Robust Optimization , 2018, J. Mach. Learn. Res..

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Feng Ruan,et al.  Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval , 2017, Information and Inference: A Journal of the IMA.

[32]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[33]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[34]  Mengdi Wang,et al.  Finite-sum Composition Optimization via Variance Reduced Gradient Descent , 2016, AISTATS.

[35]  John Duchi,et al.  Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach , 2016, Math. Oper. Res..

[36]  John C. Duchi,et al.  Variance-based Regularization with Convex Objectives , 2016, NIPS.

[37]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[38]  Mengdi Wang,et al.  Accelerating Stochastic Composition Optimization , 2016, NIPS.

[39]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[40]  Mengdi Wang,et al.  Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions , 2014, Mathematical Programming.

[41]  Vishal Gupta,et al.  Data-driven robust optimization , 2013, Math. Program..

[42]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[43]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[44]  Anja De Waegenaere,et al.  Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , 2011, Manag. Sci..

[45]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[46]  Heng Huang,et al.  On the Convergence of Local Stochastic Compositional Gradient Descent with Momentum , 2022, ICML.

[47]  Tianbao Yang,et al.  FedX: Federated Learning for Compositional Pairwise Risk Optimization , 2022, ArXiv.

[48]  Quoc Tran-Dinh,et al.  Hybrid Variance-Reduced SGD Algorithms For Minimax Problems with Nonconvex-Linear Function , 2020, NeurIPS.

[49]  Xiangru Lian,et al.  Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent , 2019, NeurIPS.

[50]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .