BRIDGE: Byzantine-Resilient Decentralized Gradient Descent

Machine learning has begun to play a central role in many applications. A multitude of these applications typically also involve datasets that are distributed across multiple computing devices/machines due to either design constraints or computational/privacy reasons. Such applications often require the learning tasks to be carried out in a decentralized fashion, in which there is no central server that is directly connected to all nodes. In real-world decentralized settings, nodes are prone to undetected failures due to malfunctioning equipment, cyberattacks, etc., which are likely to crash non-robust learning algorithms. The focus of this paper is on robustification of decentralized learning in the presence of nodes that have undergone Byzantine failures. The Byzantine failure model allows faulty nodes to arbitrarily deviate from their intended behaviors, thereby ensuring designs of the most robust of algorithms. But the study of Byzantine resilience within decentralized learning, in contrast to distributed learning, is still in its infancy. In particular, existing Byzantine-resilient decentralized learning methods either do not scale well to large-scale machine learning models, or they lack statistical convergence guarantees that help characterize their generalization errors. In this paper, a scalable, Byzantine-resilient decentralized machine learning framework termed Byzantine-resilient decentralized gradient descent (BRIDGE) is introduced. Algorithmic and statistical convergence guarantees are also provided in the paper for both strongly convex problems and a class of nonconvex problems. In addition, large-scale decentralized learning experiments are used to establish that the BRIDGE framework is scalable and it delivers competitive results for Byzantine-resilient convex and nonconvex learning.

[1]  Qing Ling,et al.  Byzantine-robust decentralized stochastic optimization over static and time-varying networks , 2021, Signal Process..

[2]  Tao Sun,et al.  Decentralized Federated Averaging , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Shuangzhe Liu,et al.  Statistical Machine Learning – A Unified Framework , 2021, International Statistical Review.

[4]  R. Guerraoui,et al.  Fast and Robust Distributed Learning in High Dimension , 2020, 2020 International Symposium on Reliable Distributed Systems (SRDS).

[5]  Soummya Kar,et al.  Fast Decentralized Nonconvex Finite-Sum Optimization with Recursive Variance Reduction , 2020, SIAM J. Optim..

[6]  Rachid Guerraoui,et al.  Collaborative Learning as an Agreement Problem , 2020, ArXiv.

[7]  Haoran Sun,et al.  Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: Joint Gradient Estimation and Tracking , 2020, ICML.

[8]  Bariscan Yonel,et al.  A Deterministic Theory for Exact Non-Convex Phase Retrieval , 2020, IEEE Transactions on Signal Processing.

[9]  Suhas Diggavi,et al.  Byzantine-Resilient High-Dimensional SGD with Local Iterations on Heterogeneous Data , 2020, ICML.

[10]  Haroon Raja,et al.  Scaling-Up Distributed Processing of Data Streams for Machine Learning , 2020, Proceedings of the IEEE.

[11]  Shreyas Sundaram,et al.  Byzantine-Resilient Distributed Optimization of Multi-Dimensional Functions , 2020, 2020 American Control Conference (ACC).

[12]  Xiaofei Xie,et al.  Towards Byzantine-resilient Learning in Decentralized Systems , 2020, ArXiv.

[13]  Jia Liu,et al.  Byzantine-Resilient Stochastic Gradient Descent for Distributed Learning: A Lipschitz-Inspired Coordinate-wise Median Approach , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[14]  Waheed U. Bajwa,et al.  Adversary-Resilient Distributed and Decentralized Statistical Inference and Machine Learning: An Overview of Recent Advances Under the Byzantine Threat Model , 2019, IEEE Signal Processing Magazine.

[15]  W. Bajwa,et al.  BRIDGE: Byzantine-resilient Decentralized Gradient Descent , 2019, ArXiv.

[16]  Hongyi Wang,et al.  DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation , 2019, NeurIPS.

[17]  G. Giannakis,et al.  RSA: Byzantine-Robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets , 2019, Proceedings of the AAAI Conference on Artificial Intelligence.

[18]  Deepesh Data,et al.  Data Encoding for Byzantine-Resilient Distributed Optimization , 2019, IEEE Transactions on Information Theory.

[19]  Sebastian Bock,et al.  A Proof of Local Convergence for the Adam Optimizer , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[20]  K. Ramchandran,et al.  Robust Federated Learning in a Heterogeneous Environment , 2019, ArXiv.

[21]  Zhiwei Steven Wu,et al.  Distributed Training with Heterogeneous Data: Bridging Median- and Mean-Based Algorithms , 2019, NeurIPS.

[22]  Rachid Guerraoui,et al.  SGD: Decentralized Byzantine Resilience , 2019, ArXiv.

[23]  Zhiwei Xiong,et al.  Byzantine-resilient Distributed Large-scale Matrix Completion , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  O. Koyejo,et al.  Zeno++: Robust Fully Asynchronous SGD , 2019, ICML.

[25]  Richeng Jin,et al.  Distributed Byzantine Tolerant Stochastic Gradient Descent in the Era of Big Data , 2019, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[26]  Qing Ling,et al.  Robust decentralized dynamic optimization at presence of malfunctioning agents , 2018, Signal Process..

[27]  Shreyas Sundaram,et al.  Resilient distributed state estimation with mobile agents: overcoming Byzantine adversaries, communication losses, and intermittent measurements , 2018, Autonomous Robots.

[28]  Dmitriy Drusvyatskiy,et al.  Uniform Graphical Convergence of Subgradients in Nonconvex Optimization and Learning , 2018, Math. Oper. Res..

[29]  Lili Su,et al.  Finite-Time Guarantees for Byzantine-Resilient Distributed State Estimation With Noisy Measurements , 2018, IEEE Transactions on Automatic Control.

[30]  Karthik Sridharan,et al.  Uniform Convergence of Gradients for Non-Convex Learning and Optimization , 2018, NeurIPS.

[31]  Kannan Ramchandran,et al.  Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning , 2018, ICML.

[32]  Angelia Nedic,et al.  Distributed stochastic gradient tracking methods , 2018, Mathematical Programming.

[33]  Indranil Gupta,et al.  Zeno: Byzantine-suspicious stochastic gradient descent , 2018, ArXiv.

[34]  Indranil Gupta,et al.  Phocas: dimensional Byzantine-resilient stochastic gradient descent , 2018, ArXiv.

[35]  Lili Su,et al.  Securing Distributed Machine Learning in High Dimensions , 2018, ArXiv.

[36]  Xinyang Cao,et al.  Robust Distributed Gradient Descent with Arbitrary Number of Byzantine Attackers , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Dimitris S. Papailiopoulos,et al.  DRACO: Byzantine-resilient Distributed Training via Redundant Gradients , 2018, ICML.

[38]  Dan Alistarh,et al.  Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[39]  Kannan Ramchandran,et al.  Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[40]  Rachid Guerraoui,et al.  The Hidden Vulnerability of Distributed Learning in Byzantium , 2018, ICML.

[41]  Prateek Jain,et al.  Non-convex Optimization for Machine Learning , 2017, Found. Trends Mach. Learn..

[42]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[43]  Michael G. Rabbat,et al.  Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization , 2017, Proceedings of the IEEE.

[44]  Waheed Uz Zaman Bajwa,et al.  ByRDiE: Byzantine-Resilient Distributed Coordinate Descent for Decentralized Learning , 2017, IEEE Transactions on Signal and Information Processing over Networks.

[45]  Aryan Mokhtari,et al.  Network Newton Distributed Optimization Methods , 2017, IEEE Transactions on Signal Processing.

[46]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[47]  Yi Zhou,et al.  Geometrical properties and accelerated gradient solvers of non-convex phase retrieval , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[48]  Wotao Yin,et al.  On Nonconvex Decentralized Gradient Descent , 2016, IEEE Transactions on Signal Processing.

[49]  Nitin H. Vaidya,et al.  Fault-Tolerant Multi-Agent Optimization: Optimal Iterative Distributed Algorithms , 2016, PODC.

[50]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[51]  Nitin H. Vaidya,et al.  Multi-agent optimization in the presence of Byzantine adversaries: Fundamental limits , 2016, 2016 American Control Conference (ACC).

[52]  B. Gharesifard,et al.  Distributed Optimization Under Adversarial Nodes , 2016, IEEE Transactions on Automatic Control.

[53]  Peter Ochs,et al.  Local Convergence of the Heavy-Ball Method and iPiano for Non-convex Optimization , 2016, J. Optim. Theory Appl..

[54]  Waheed Uz Zaman Bajwa,et al.  RD-SVM: A resilient distributed support vector machine , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Aryan Mokhtari,et al.  A Decentralized Second-Order Method with Exact Linear Convergence Rate for Consensus Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[56]  Gesualdo Scutari,et al.  NEXT: In-Network Nonconvex Optimization , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[57]  Nitin H. Vaidya,et al.  Fault-Tolerant Distributed Optimization (Part IV): Constrained Optimization with Arbitrary Directed Networks , 2015, ArXiv.

[58]  John Wright,et al.  When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[59]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[60]  Ali Sayed,et al.  Adaptation, Learning, and Optimization over Networks , 2014, Found. Trends Mach. Learn..

[61]  Qing Ling,et al.  On the Linear Convergence of the ADMM in Decentralized Consensus Optimization , 2013, IEEE Transactions on Signal Processing.

[62]  Nitin H. Vaidya,et al.  Iterative Byzantine Vector Consensus in Incomplete Graphs , 2013, ICDCN.

[63]  Shreyas Sundaram,et al.  Resilient Asymptotic Consensus in Robust Networks , 2013, IEEE Journal on Selected Areas in Communications.

[64]  Angelia Nedic,et al.  Distributed optimization over time-varying directed graphs , 2013, 52nd IEEE Conference on Decision and Control.

[65]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[66]  Alysson Neves Bessani,et al.  From Byzantine Consensus to BFT State Machine Replication: A Latency-Optimal Transformation , 2012, 2012 Ninth European Dependable Computing Conference.

[67]  João M. F. Xavier,et al.  D-ADMM: A Communication-Efficient Distributed Algorithm for Separable Optimization , 2012, IEEE Transactions on Signal Processing.

[68]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[69]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[70]  Georgios B. Giannakis,et al.  Consensus-Based Distributed Support Vector Machines , 2010, J. Mach. Learn. Res..

[71]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[72]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[73]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[74]  H. Vincent Poor,et al.  Distributed learning in wireless sensor networks , 2005, IEEE Signal Processing Magazine.

[75]  Håkan Sivencrona,et al.  Byzantine Fault Tolerance, from Theory to Reality , 2003, SAFECOMP.

[76]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[77]  T. Ypma Local Convergence of Inexact Newton Methods , 1984 .

[78]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[79]  Sai Praneeth Karimireddy,et al.  Byzantine-Robust Decentralized Learning via Self-Centered Clipping , 2022, ArXiv.

[80]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[81]  Yoshua Bengio Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[82]  R. Guerraoui,et al.  Best-Case Complexity of Asynchronous Byzantine Consensus , 2005 .

[83]  Jean-Louis Verger-Gaugry,et al.  Covering a Ball with Smaller Equal Balls in ℝn , 2005, Discret. Comput. Geom..

[84]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[85]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[86]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[87]  Rachid Guerraoui,et al.  Asynchronous Byzantine Machine Learning ( the case of SGD ) Supplementary Material , 2022 .