Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top

Byzantine-robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. However, many fruitful directions, such as the usage of variance reduction for achieving robustness and communication compression for reducing communication costs, remain weakly explored in the field. This work addresses this gap and proposes Byz-VR-MARINA –a new Byzantine-tolerant method with variance reduction and compression. A key mes-sage of our paper is that variance reduction is key to fighting Byzantine workers more effectively. At the same time, communication compression is a bonus that makes the process more communication efficient. We derive theoretical convergence guarantees for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Łojasiewicz loss functions. Unlike the con-current Byzantine-robust methods with variance reduction and/or compression, our complexity results are tight and do not rely on restrictive assumptions such as boundedness of the gradients or limited compression. Moreover, we provide the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients. Numerical experiments corroborate our theoretical findings.

[1]  Eduard A. Gorbunov,et al.  Distributed Methods with Absolute Compression and Error Compensation , 2022, 2203.02383.

[2]  Eduard A. Gorbunov,et al.  Stochastic Gradient Descent-Ascent: Unified Theory and New Efficient Methods , 2022, ArXiv.

[3]  Sai Praneeth Karimireddy,et al.  Byzantine-Robust Decentralized Learning via Self-Centered Clipping , 2022, ArXiv.

[4]  Alexander Tyurin,et al.  Permutation Compressors for Provably Faster Distributed Nonconvex Optimization , 2021, ArXiv.

[5]  Eduard A. Gorbunov,et al.  Secure Distributed Training at Scale , 2021, International Conference on Machine Learning.

[6]  Peter Richtárik,et al.  FedNL: Making Newton-Type Methods Applicable to Federated Learning , 2021, ICML.

[7]  Philip S. Yu,et al.  Privacy and Robustness in Federated Learning: Attacks and Defenses , 2020, IEEE transactions on neural networks and learning systems.

[8]  Sai Praneeth Karimireddy,et al.  Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing , 2020, ICLR.

[9]  Michael I. Jordan,et al.  Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization , 2020, SIAM J. Math. Data Sci..

[10]  Zaïd Harchaoui,et al.  Robust Aggregation for Federated Learning , 2019, IEEE Transactions on Signal Processing.

[11]  Waheed U. Bajwa,et al.  BRIDGE: Byzantine-Resilient Decentralized Gradient Descent , 2019, IEEE Transactions on Signal and Information Processing over Networks.

[12]  Peter Richtárik,et al.  Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees , 2021, ArXiv.

[13]  Aritra Dutta,et al.  Rethinking gradient sparsification as total error minimization , 2021, NeurIPS.

[14]  Peter Richt'arik,et al.  CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression , 2021, NeurIPS.

[15]  Gennady Pekhimenko,et al.  Distributed Deep Learning in Open Collaborations , 2021, NeurIPS.

[16]  Peter Richt'arik,et al.  EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback , 2021, NeurIPS.

[17]  Nitin H. Vaidya,et al.  Byzantine Fault-Tolerance in Decentralized Optimization under 2f-Redundancy , 2021, 2021 American Control Conference (ACC).

[18]  BROADCAST: Reducing Both Stochastic and Compression Noise to Robustify Communication-Efficient Federated Learning , 2021, ArXiv.

[19]  Aymeric Dieuleveut,et al.  Preserved central model for faster bidirectional compression in distributed settings , 2021, NeurIPS.

[20]  Eduard A. Gorbunov,et al.  MARINA: Faster Non-Convex Distributed Learning with Compression , 2021, ICML.

[21]  Peter Richtárik,et al.  Distributed Second Order Methods with Fast Rates and Compressed Communication , 2021, ICML.

[22]  Nitin H. Vaidya,et al.  Byzantine Fault-Tolerance in Peer-to-Peer Distributed Gradient-Descent , 2021, ArXiv.

[23]  Dan Alistarh,et al.  Byzantine-Resilient Non-Convex Stochastic Gradient Descent , 2020, ICLR.

[24]  Martin Jaggi,et al.  Learning from History for Byzantine Robust Optimization , 2020, ICML.

[25]  Martin Jaggi,et al.  A Linearly Convergent Algorithm for Decentralized Optimization: Sending Less Bits for Free! , 2020, AISTATS.

[26]  Tong Zhang,et al.  Error Compensated Distributed SGD Can Be Accelerated , 2020, NeurIPS.

[27]  Xiangliang Zhang,et al.  PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization , 2020, ICML.

[28]  R. Guerraoui,et al.  Collaborative Learning in the Jungle (Decentralized, Byzantine, Heterogeneous, Asynchronous and Nonconvex Learning) , 2020, NeurIPS.

[29]  Aryan Mokhtari,et al.  Federated Learning with Compression: Unified Analysis and Sharp Guarantees , 2020, AISTATS.

[30]  Qing Ling,et al.  Byzantine-Robust Decentralized Stochastic Optimization over Static and Time-Varying Networks , 2020, Signal Process..

[31]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[32]  Kannan Ramchandran,et al.  Communication-Efficient and Byzantine-Robust Distributed Learning With Error Feedback , 2019, IEEE Journal on Selected Areas in Information Theory.

[33]  Peter Richtárik,et al.  L-SVRG and L-Katyusha with Arbitrary Sampling , 2019, J. Mach. Learn. Res..

[34]  Lingjuan Lyu,et al.  Towards Building a Robust and Fair Federated Learning System , 2020, ArXiv.

[35]  Dan Alistarh,et al.  Adaptive Gradient Quantization for Data-Parallel SGD , 2020, NeurIPS.

[36]  Eduard A. Gorbunov,et al.  Linearly Converging Error Compensated SGD , 2020, NeurIPS.

[37]  Mark W. Schmidt,et al.  Variance-Reduced Methods for Machine Learning , 2020, Proceedings of the IEEE.

[38]  Francisco Herrera,et al.  Dynamic Federated Learning Model for Identifying Adversarial Clients , 2020, ArXiv.

[39]  Jayanth Reddy Regatti,et al.  ByGARS: Byzantine SGD with Arbitrary Number of Attackers. , 2020 .

[40]  A. Mazumdar,et al.  Distributed Newton Can Communicate Less and Resist Byzantine Workers , 2020, NeurIPS.

[41]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[42]  Mikhail Belkin,et al.  Loss landscapes and optimization in over-parameterized non-linear systems and neural networks , 2020, Applied and Computational Harmonic Analysis.

[43]  Peter Richtárik,et al.  On Biased Compression for Distributed Learning , 2020, ArXiv.

[44]  Zhize Li,et al.  Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization , 2020, ICML.

[45]  Qing Ling,et al.  Federated Variance-Reduced Stochastic Gradient Descent With Robustness to Byzantine Attacks , 2019, IEEE Transactions on Signal Processing.

[46]  S. Gelly,et al.  Big Transfer (BiT): General Visual Representation Learning , 2019, ECCV.

[47]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[48]  Adel Bibi,et al.  A Stochastic Derivative Free Optimization Method with Momentum , 2019, ICLR.

[49]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[50]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[51]  Hongyi Wang,et al.  DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation , 2019, NeurIPS.

[52]  Guanghui Lan,et al.  A unified variance-reduced accelerated gradient method for convex optimization , 2019, NeurIPS.

[53]  Marco Canini,et al.  Natural Compression for Distributed Deep Learning , 2019, MSML.

[54]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[55]  Ji Liu,et al.  DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[56]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[57]  Rachid Guerraoui,et al.  AGGREGATHOR: Byzantine Machine Learning via Robust Gradient Aggregation , 2019, SysML.

[58]  Sebastian U. Stich,et al.  Stochastic Distributed Learning with Gradient Quantization and Variance Reduction , 2019, 1904.05115.

[59]  Indranil Gupta,et al.  Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation , 2019, UAI.

[60]  Moran Baruch,et al.  A Little Is Enough: Circumventing Defenses For Distributed Learning , 2019, NeurIPS.

[61]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[62]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[63]  Peter Richtárik,et al.  SGD: General Analysis and Improved Rates , 2019, ICML 2019.

[64]  Peter Richtárik,et al.  Distributed Learning with Compressed Gradient Differences , 2019, ArXiv.

[65]  Peter Richtárik,et al.  SAGA with Arbitrary Sampling , 2019, ICML.

[66]  Kamyar Azizzadenesheli,et al.  signSGD with Majority Vote is Communication Efficient and Fault Tolerant , 2018, ICLR.

[67]  Peter Richt'arik,et al.  Nonconvex Variance Reduced Optimization with Arbitrary Sampling , 2018, ICML.

[68]  Waheed Uz Zaman Bajwa,et al.  ByRDiE: Byzantine-Resilient Distributed Coordinate Descent for Decentralized Learning , 2017, IEEE Transactions on Signal and Information Processing over Networks.

[69]  Lili Su,et al.  Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent , 2019, PERV.

[70]  Hiroaki Mikami,et al.  Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash , 2018 .

[71]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[72]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[73]  Ekasit Kijsipongse,et al.  A hybrid GPU cluster and volunteer computing platform for scalable deep learning , 2018, The Journal of Supercomputing.

[74]  Dimitris S. Papailiopoulos,et al.  DRACO: Byzantine-resilient Distributed Training via Redundant Gradients , 2018, ICML.

[75]  Dan Alistarh,et al.  Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[76]  Kannan Ramchandran,et al.  Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[77]  Rachid Guerraoui,et al.  The Hidden Vulnerability of Distributed Learning in Byzantium , 2018, ICML.

[78]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[79]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[80]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[81]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[82]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[83]  Jie Liu,et al.  SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017, ICML.

[84]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[85]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[86]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[87]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[88]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[89]  Artin,et al.  SARAH : A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient , 2017 .

[90]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[91]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[92]  Nitin H. Vaidya,et al.  Fault-Tolerant Multi-Agent Optimization: Optimal Iterative Distributed Algorithms , 2016, PODC.

[93]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[94]  Peter Richtárik,et al.  On optimal probabilities in stochastic coordinate descent methods , 2013, Optim. Lett..

[95]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[96]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[97]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[98]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[99]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[100]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[101]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[102]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[103]  Boris Polyak Gradient methods for the minimisation of functionals , 1963 .

[104]  Lawrence G. Roberts,et al.  Picture coding using pseudo-random noise , 1962, IRE Trans. Inf. Theory.

[105]  W. M. Goodall Television by pulse code modulation , 1951 .