Utilizing Redundancy in Cost Functions for Resilience in Distributed Optimization and Learning

This paper considers the problem of resilient distributed optimization and stochastic machine learning in a server-based architecture. The system comprises a server and multiple agents, where each agent has a local cost function. The agents collaborate with the server to find a minimum of their aggregate cost functions. We consider the case when some of the agents may be asynchronous and/or Byzantine faulty. In this case, the classical algorithm of distributed gradient descent (DGD) is rendered ineffective. Our goal is to design techniques improving the efficacy of DGD with asynchrony and Byzantine failures. To do so, we start by proposing a way to model the agents’ cost functions by the generic notion of (f, r; )-redundancy where f and r are the parameters of Byzantine failures and asynchrony, respectively, and characterizes the closeness between agents’ cost functions. This allows us to quantify the level of redundancy present amongst the agents’ cost functions, for any given distributed optimization problem. We demonstrate, both theoretically and empirically, the merits of our proposed redundancy model in improving the robustness of DGD against asynchronous and Byzantine agents, and their extensions to distributed stochastic gradient descent (D-SGD) for robust distributed machine learning with asynchronous and Byzantine agents. This report supersedes our previous report [36] as it contains the most of the results in it. Georgetown University. Email: sl1539@georgetown.edu. École Polytechnique Fédérale de Lausanne (EPFL). Email: nirupam.gupta@epfl.ch. Georgetown University. Email: nitin.vaidya@georgetown.edu. 1 ar X iv :2 11 0. 10 85 8v 1 [ cs .D C ] 2 1 O ct 2 02 1

[1]  Nitin H. Vaidya,et al.  Asynchronous Distributed Optimization with Redundancy in Cost Functions , 2021, ArXiv.

[2]  Shuo Liu,et al.  A Survey on Fault-tolerance in Distributed Optimization and Machine Learning , 2021, ArXiv.

[3]  Wotao Yin,et al.  More Iterations per Second, Same Quality - Why Asynchronous Algorithms may Drastically Outperform Traditional Ones , 2017, ArXiv.

[4]  Stephen P. Boyd,et al.  Distributed optimization for cooperative agents: application to formation flight , 2004, 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No.04CH37601).

[5]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[6]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[7]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[8]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[9]  Indranil Gupta,et al.  Zeno: Byzantine-suspicious stochastic gradient descent , 2018, ArXiv.

[10]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[11]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[12]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[13]  Thinh T. Doan,et al.  Byzantine Fault-Tolerance in Federated Local SGD under 2f-Redundancy , 2021, IEEE Transactions on Control of Network Systems.

[14]  Damiano Varagnolo,et al.  Newton-Raphson Consensus for Distributed Convex Optimization , 2015, IEEE Transactions on Automatic Control.

[15]  Martin Jaggi,et al.  Learning from History for Byzantine Robust Optimization , 2020, ICML.

[16]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[17]  Soummya Kar,et al.  Coded Distributed Computing for Inverse Problems , 2017, NIPS.

[18]  Rémi Leblond,et al.  Asynchronous optimization for machine learning , 2018 .

[19]  Jugal K. Kalita,et al.  A Survey of the Usages of Deep Learning for Natural Language Processing , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Jakub Konecný,et al.  Federated Optimization: Distributed Optimization Beyond the Datacenter , 2015, ArXiv.

[21]  Nitin H. Vaidya,et al.  Fault-Tolerant Multi-Agent Optimization: Optimal Iterative Distributed Algorithms , 2016, PODC.

[22]  Nihar B. Shah,et al.  When do redundant requests reduce latency ? , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[23]  Hamid Reza Feyzmahdavian,et al.  Advances in Asynchronous Parallel and Distributed Optimization , 2020, Proceedings of the IEEE.

[24]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[25]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[26]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[27]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[28]  Nitin H. Vaidya,et al.  Byzantine Fault-Tolerant Distributed Machine Learning with Norm-Based Comparative Gradient Elimination , 2021, 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[29]  Nitin H. Vaidya,et al.  Resilience in Collaborative Optimization: Redundant and Independent Cost Functions , 2020, ArXiv.

[30]  Zeyuan Allen Zhu,et al.  Optimal Black-Box Reductions Between Optimization Objectives , 2016, NIPS.

[31]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests: Exact Analysis , 2015, SIGMETRICS 2015.

[32]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[33]  Lili Su,et al.  Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent , 2019, PERV.

[34]  Gregory W. Wornell,et al.  Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing , 2015, PERV.

[35]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[36]  Rachid Guerraoui,et al.  AGGREGATHOR: Byzantine Machine Learning via Robust Gradient Aggregation , 2019, SysML.

[37]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[38]  Lisandro Dalcin,et al.  Parallel distributed computing using Python , 2011 .

[39]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[40]  Suhas N. Diggavi,et al.  Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning , 2018, J. Mach. Learn. Res..

[41]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[42]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[43]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[44]  Dimitris S. Papailiopoulos,et al.  DRACO: Byzantine-resilient Distributed Training via Redundant Gradients , 2018, ICML.

[45]  Kannan Ramchandran,et al.  Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[46]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[47]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[48]  Hamid Reza Feyzmahdavian,et al.  An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[49]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[50]  Nitin H. Vaidya,et al.  Approximate Byzantine Fault-Tolerance in Distributed Optimization , 2021, PODC.

[51]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[52]  Suhas N. Diggavi,et al.  Encoded distributed optimization , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[53]  Robert Nowak,et al.  Distributed optimization in sensor networks , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[54]  Randy H. Katz,et al.  Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling , 2016, J. Mach. Learn. Res..

[55]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.