论文信息 - Data Encoding for Byzantine-Resilient Distributed Optimization

Data Encoding for Byzantine-Resilient Distributed Optimization

We study distributed optimization in the presence of Byzantine adversaries, where both data and computation are distributed among <inline-formula> <tex-math notation="LaTeX">$m$ </tex-math></inline-formula> worker machines, <inline-formula> <tex-math notation="LaTeX">$t$ </tex-math></inline-formula> of which may be corrupt. The compromised nodes may collaboratively and arbitrarily deviate from their pre-specified programs, and a designated (master) node iteratively computes the model/parameter vector for <italic>generalized linear models</italic>. In this work, we primarily focus on two iterative algorithms: <italic>Proximal Gradient Descent</italic> (PGD) and <italic>Coordinate Descent</italic> (CD). Gradient descent (GD) is a special case of these algorithms. PGD is typically used in the data-parallel setting, where data is partitioned across different samples, whereas, CD is used in the model-parallelism setting, where data is partitioned across the parameter space. At the core of our solutions to both these algorithms is a method for Byzantine-resilient matrix-vector (MV) multiplication; and for that, we propose a method based on data encoding and error correction over real numbers to combat adversarial attacks. We can tolerate up to <inline-formula> <tex-math notation="LaTeX">$t\leq \lfloor \frac {m-1}{2}\rfloor $ </tex-math></inline-formula> corrupt worker nodes, which is information-theoretically optimal. We give deterministic guarantees, and our method does not assume any probability distribution on the data. We develop a <italic>sparse</italic> encoding scheme which enables computationally efficient data encoding and decoding. We demonstrate a trade-off between the corruption threshold and the resource requirements (storage, computational, and communication complexity). As an example, for <inline-formula> <tex-math notation="LaTeX">$t\leq \frac {m}{3}$ </tex-math></inline-formula>, our scheme incurs only a <italic>constant</italic> overhead on these resources, over that required by the plain distributed PGD/CD algorithms which provide no adversarial protection. To the best of our knowledge, ours is the first paper that connects MV multiplication with CD and designs a specific encoding matrix for MV multiplication whose structure we can leverage to make CD secure against adversarial attacks. Our encoding scheme extends <italic>efficiently</italic> to <italic>(i)</italic> the data streaming model, in which data samples come in an online fashion and are encoded as they arrive, and <italic>(ii)</italic> making <italic>stochastic gradient descent</italic> (SGD) Byzantine-resilient. In the end, we give experimental results to show the efficacy of our proposed schemes.

Suhas Diggavi | Linqi Song | Deepesh Data

[1] Kannan Ramchandran,et al. Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning , 2018, ICML.

[2] Ohad Shamir,et al. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[3] Emmanuel J. Candès,et al. Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[4] Suhas N. Diggavi,et al. Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[5] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[6] Martin Jaggi,et al. An Equivalence between the Lasso and Support Vector Machines , 2013, ArXiv.

[7] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[8] Jimmy J. Lin,et al. WTF: the who to follow service at Twitter , 2013, WWW.

[9] Yonina C. Eldar,et al. Reduce and Boost: Recovering Arbitrary Sets of Jointly Sparse Vectors , 2008, IEEE Transactions on Signal Processing.

[10] Suhas Diggavi,et al. Byzantine-Resilient High-Dimensional SGD with Local Iterations on Heterogeneous Data , 2020, ICML.

[11] Pulkit Grover,et al. “Short-Dot”: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products , 2017, IEEE Transactions on Information Theory.

[12] Alexandros G. Dimakis,et al. Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[13] Dimitris S. Papailiopoulos,et al. Gradient Coding Using the Stochastic Block Model , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[14] Luiz André Barroso,et al. The tail at scale , 2013, CACM.

[15] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[16] Jakub Konecný,et al. Stochastic, Distributed and Federated Optimization for Machine Learning , 2017, ArXiv.

[17] T. Coleman,et al. The null space problem I. complexity , 1986 .

[18] Qing Ling,et al. RSA: Byzantine-Robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets , 2018, AAAI.

[19] Dan Alistarh,et al. Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[20] H. Robbins. A Stochastic Approximation Method , 1951 .

[21] Lili Su,et al. Securing Distributed Gradient Descent in High Dimensional Statistical Learning , 2019, SIGMETRICS.

[22] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[23] Suhas N. Diggavi,et al. Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning , 2018, J. Mach. Learn. Res..

[24] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[25] Joseph K. Bradley,et al. Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[26] Ivan Damgård,et al. Secure Multiparty Computation and Secret Sharing , 2015 .

[27] Jim Hefferon,et al. Linear Algebra , 2012 .

[28] Suhas Diggavi,et al. Data Encoding Methods for Byzantine-Resilient Distributed Optimization , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[29] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.

[30] Tinkara Toš,et al. Graph Algorithms in the Language of Linear Algebra , 2012, Software, environments, tools.

[31] Dimitris S. Papailiopoulos,et al. DRACO: Byzantine-resilient Distributed Training via Redundant Gradients , 2018, ICML.

[32] Ilse C. F. Ipsen,et al. Mathematical properties and analysis of Google's PageRank , 2008 .

[33] Nitin H. Vaidya,et al. Byzantine Fault-Tolerant Parallelized Stochastic Gradient Descent for Linear Regression , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[34] Amir Salman Avestimehr,et al. Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy , 2018, AISTATS.

[35] Kannan Ramchandran,et al. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[36] Martin Jaggi,et al. Byzantine-Robust Learning on Heterogeneous Datasets via Resampling , 2020, ArXiv.

[37] Peter Richtárik,et al. Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[38] Stephen J. Wright. Coordinate descent algorithms , 2015, Mathematical Programming.

[39] Babak Hassibi,et al. Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[40] Suhas Diggavi,et al. Byzantine-Tolerant Distributed Coordinate Descent , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[41] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[42] Hongyi Wang,et al. DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation , 2019, NeurIPS.

[43] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[44] Rachid Guerraoui,et al. The Hidden Vulnerability of Distributed Learning in Byzantium , 2018, ICML.

[45] Yousef Saad,et al. Iterative methods for sparse linear systems , 2003 .

[46] Suhas Diggavi,et al. Byzantine-Resilient SGD in High Dimensions on Heterogeneous Data , 2020, 2021 IEEE International Symposium on Information Theory (ISIT).

[47] Lili Su,et al. Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent , 2019, PERV.

[48] J. Norris. Appendix: probability and measure , 1997 .

[49] Kannan Ramchandran,et al. Robust Federated Learning in a Heterogeneous Environment , 2019, ArXiv.

[50] Kannan Ramchandran,et al. Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[51] Paulo Tabuada,et al. Will Distributed Computing Revolutionize Peace? The Emergence of Battlefield IoT , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[52] Indranil Gupta,et al. Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance , 2018, ICML.

[53] Ambuj Tewari,et al. Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[54] Adi Shamir,et al. How to share a secret , 1979, CACM.

[55] Vahid Tarokh,et al. A Frame Construction and a Universal Distortion Bound for Sparse Representations , 2008, IEEE Transactions on Signal Processing.

[56] Rachid Guerraoui,et al. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[57] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[58] Suhas N. Diggavi,et al. Data Encoding for Byzantine-Resilient Distributed Gradient Descent , 2018, 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[59] Alexandros G. Dimakis,et al. Gradient Coding From Cyclic MDS Codes and Expander Graphs , 2017, IEEE Transactions on Information Theory.