Data Encoding for Byzantine-Resilient Distributed Optimization

We study distributed optimization in the presence of Byzantine adversaries, where both data and computation are distributed among <inline-formula> <tex-math notation="LaTeX">$m$ </tex-math></inline-formula> worker machines, <inline-formula> <tex-math notation="LaTeX">$t$ </tex-math></inline-formula> of which may be corrupt. The compromised nodes may collaboratively and arbitrarily deviate from their pre-specified programs, and a designated (master) node iteratively computes the model/parameter vector for <italic>generalized linear models</italic>. In this work, we primarily focus on two iterative algorithms: <italic>Proximal Gradient Descent</italic> (PGD) and <italic>Coordinate Descent</italic> (CD). Gradient descent (GD) is a special case of these algorithms. PGD is typically used in the data-parallel setting, where data is partitioned across different samples, whereas, CD is used in the model-parallelism setting, where data is partitioned across the parameter space. At the core of our solutions to both these algorithms is a method for Byzantine-resilient matrix-vector (MV) multiplication; and for that, we propose a method based on data encoding and error correction over real numbers to combat adversarial attacks. We can tolerate up to <inline-formula> <tex-math notation="LaTeX">$t\leq \lfloor \frac {m-1}{2}\rfloor $ </tex-math></inline-formula> corrupt worker nodes, which is information-theoretically optimal. We give deterministic guarantees, and our method does not assume any probability distribution on the data. We develop a <italic>sparse</italic> encoding scheme which enables computationally efficient data encoding and decoding. We demonstrate a trade-off between the corruption threshold and the resource requirements (storage, computational, and communication complexity). As an example, for <inline-formula> <tex-math notation="LaTeX">$t\leq \frac {m}{3}$ </tex-math></inline-formula>, our scheme incurs only a <italic>constant</italic> overhead on these resources, over that required by the plain distributed PGD/CD algorithms which provide no adversarial protection. To the best of our knowledge, ours is the first paper that connects MV multiplication with CD and designs a specific encoding matrix for MV multiplication whose structure we can leverage to make CD secure against adversarial attacks. Our encoding scheme extends <italic>efficiently</italic> to <italic>(i)</italic> the data streaming model, in which data samples come in an online fashion and are encoded as they arrive, and <italic>(ii)</italic> making <italic>stochastic gradient descent</italic> (SGD) Byzantine-resilient. In the end, we give experimental results to show the efficacy of our proposed schemes.

[1]  Kannan Ramchandran,et al.  Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning , 2018, ICML.

[2]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[3]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[4]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[5]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[6]  Martin Jaggi,et al.  An Equivalence between the Lasso and Support Vector Machines , 2013, ArXiv.

[7]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[8]  Jimmy J. Lin,et al.  WTF: the who to follow service at Twitter , 2013, WWW.

[9]  Yonina C. Eldar,et al.  Reduce and Boost: Recovering Arbitrary Sets of Jointly Sparse Vectors , 2008, IEEE Transactions on Signal Processing.

[10]  Suhas Diggavi,et al.  Byzantine-Resilient High-Dimensional SGD with Local Iterations on Heterogeneous Data , 2020, ICML.

[11]  Pulkit Grover,et al.  “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[12]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[13]  Dimitris S. Papailiopoulos,et al.  Gradient Coding Using the Stochastic Block Model , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[14]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[15]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16]  Jakub Konecný,et al.  Stochastic, Distributed and Federated Optimization for Machine Learning , 2017, ArXiv.

[17]  T. Coleman,et al.  The null space problem I. complexity , 1986 .

[18]  Qing Ling,et al.  RSA: Byzantine-Robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets , 2018, AAAI.

[19]  Dan Alistarh,et al.  Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[20]  H. Robbins A Stochastic Approximation Method , 1951 .

[21]  Lili Su,et al.  Securing Distributed Gradient Descent in High Dimensional Statistical Learning , 2019, SIGMETRICS.

[22]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[23]  Suhas N. Diggavi,et al.  Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning , 2018, J. Mach. Learn. Res..

[24]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[25]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[26]  Ivan Damgård,et al.  Secure Multiparty Computation and Secret Sharing , 2015 .

[27]  Jim Hefferon,et al.  Linear Algebra , 2012 .

[28]  Suhas Diggavi,et al.  Data Encoding Methods for Byzantine-Resilient Distributed Optimization , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[29]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[30]  Tinkara Toš,et al.  Graph Algorithms in the Language of Linear Algebra , 2012, Software, environments, tools.

[31]  Dimitris S. Papailiopoulos,et al.  DRACO: Byzantine-resilient Distributed Training via Redundant Gradients , 2018, ICML.

[32]  Ilse C. F. Ipsen,et al.  Mathematical properties and analysis of Google's PageRank , 2008 .

[33]  Nitin H. Vaidya,et al.  Byzantine Fault-Tolerant Parallelized Stochastic Gradient Descent for Linear Regression , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[34]  Amir Salman Avestimehr,et al.  Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[35]  Kannan Ramchandran,et al.  Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[36]  Martin Jaggi,et al.  Byzantine-Robust Learning on Heterogeneous Datasets via Resampling , 2020, ArXiv.

[37]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[38]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[39]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[40]  Suhas Diggavi,et al.  Byzantine-Tolerant Distributed Coordinate Descent , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[41]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[42]  Hongyi Wang,et al.  DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation , 2019, NeurIPS.

[43]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[44]  Rachid Guerraoui,et al.  The Hidden Vulnerability of Distributed Learning in Byzantium , 2018, ICML.

[45]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[46]  Suhas Diggavi,et al.  Byzantine-Resilient SGD in High Dimensions on Heterogeneous Data , 2020, 2021 IEEE International Symposium on Information Theory (ISIT).

[47]  Lili Su,et al.  Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent , 2019, PERV.

[48]  J. Norris Appendix: probability and measure , 1997 .

[49]  Kannan Ramchandran,et al.  Robust Federated Learning in a Heterogeneous Environment , 2019, ArXiv.

[50]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[51]  Paulo Tabuada,et al.  Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[52]  Indranil Gupta,et al.  Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance , 2018, ICML.

[53]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[54]  Adi Shamir,et al.  How to share a secret , 1979, CACM.

[55]  Vahid Tarokh,et al.  A Frame Construction and a Universal Distortion Bound for Sparse Representations , 2008, IEEE Transactions on Signal Processing.

[56]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[57]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[58]  Suhas N. Diggavi,et al.  Data Encoding for Byzantine-Resilient Distributed Gradient Descent , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[59]  Alexandros G. Dimakis,et al.  Gradient Coding From Cyclic MDS Codes and Expander Graphs , 2017, IEEE Transactions on Information Theory.