Impact of Redundancy on Resilience in Distributed Optimization and Learning

This paper considers the problem of resilient distributed optimization and stochastic learning in a server-based architecture. The system comprises a server and multiple agents, where each agent has its own local cost function. The agents collaborate with the server to find a minimum of the aggregate of the local cost functions. In the context of stochastic learning, the local cost of an agent is the loss function computed over the data at that agent. In this paper, we consider this problem in a system wherein some of the agents may be Byzantine faulty and some of the agents may be slow (also called stragglers). In this setting, we investigate the conditions under which it is possible to obtain an “approximate” solution to the above problem. In particular, we introduce the notion of (f, r;ϵ)-resilience to characterize how well the true solution is approximated in the presence of up to f Byzantine faulty agents, and up to r slow agents (or stragglers) – smaller ϵ represents a better approximation. We also introduce a measure named (f, r;ϵ)-redundancy to characterize the redundancy in the cost functions of the agents. Greater redundancy allows for a better approximation when solving the problem of aggregate cost minimization. In this paper, we constructively show (both theoretically and empirically) that -resilience can indeed be achieved in practice, given that the local cost functions are sufficiently redundant. Our empirical evaluation considers a distributed gradient descent (DGD)-based solution; for distributed learning in the presence of Byzantine and asynchronous agents, we also evaluate a distributed stochastic gradient descent (D-SGD)-based algorithm.

[1]  Shuo Liu,et al.  A Survey on Fault-tolerance in Distributed Optimization and Machine Learning , 2021, ArXiv.

[2]  Nitin H. Vaidya,et al.  Byzantine Fault-Tolerant Distributed Machine Learning with Norm-Based Comparative Gradient Elimination , 2021, 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[3]  Nitin H. Vaidya,et al.  Approximate Byzantine Fault-Tolerance in Distributed Optimization , 2021, PODC.

[4]  Martin Jaggi,et al.  Learning from History for Byzantine Robust Optimization , 2020, ICML.

[5]  Jugal K. Kalita,et al.  A Survey of the Usages of Deep Learning for Natural Language Processing , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Hamid Reza Feyzmahdavian,et al.  Advances in Asynchronous Parallel and Distributed Optimization , 2020, Proceedings of the IEEE.

[7]  Nitin H. Vaidya,et al.  Resilience in Collaborative Optimization: Redundant and Independent Cost Functions , 2020, ArXiv.

[8]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[9]  Rachid Guerraoui,et al.  AGGREGATHOR: Byzantine Machine Learning via Robust Gradient Aggregation , 2019, SysML.

[10]  Nitin H. Vaidya,et al.  Byzantine Fault Tolerant Distributed Linear Regression , 2019, ArXiv.

[11]  Suhas N. Diggavi,et al.  Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning , 2018, J. Mach. Learn. Res..

[12]  Lili Su,et al.  Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent , 2019, PERV.

[13]  Alex Lamb,et al.  Deep Learning for Classical Japanese Literature , 2018, ArXiv.

[14]  Rémi Leblond,et al.  Asynchronous optimization for machine learning , 2018 .

[15]  Dimitris S. Papailiopoulos,et al.  DRACO: Byzantine-resilient Distributed Training via Redundant Gradients , 2018, ICML.

[16]  Kannan Ramchandran,et al.  Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[17]  Babak Hassibi,et al.  Improving Distributed Gradient Descent Using Reed-Solomon Codes , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[18]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[19]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[20]  Soummya Kar,et al.  Coded Distributed Computing for Inverse Problems , 2017, NIPS.

[21]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[22]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[23]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[24]  Wotao Yin,et al.  More Iterations per Second, Same Quality - Why Asynchronous Algorithms may Drastically Outperform Traditional Ones , 2017, ArXiv.

[25]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[26]  Suhas N. Diggavi,et al.  Encoded distributed optimization , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[27]  Lili Su,et al.  Distributed Statistical Machine Learning in Adversarial Settings , 2017, Proc. ACM Meas. Anal. Comput. Syst..

[28]  Nitin H. Vaidya,et al.  Fault-Tolerant Multi-Agent Optimization: Optimal Iterative Distributed Algorithms , 2016, PODC.

[29]  Zeyuan Allen Zhu,et al.  Optimal Black-Box Reductions Between Optimization Objectives , 2016, NIPS.

[30]  Damiano Varagnolo,et al.  Newton-Raphson Consensus for Distributed Convex Optimization , 2015, IEEE Transactions on Automatic Control.

[31]  Randy H. Katz,et al.  Multi-Task Learning for Straggler Avoiding Predictive Job Scheduling , 2016, J. Mach. Learn. Res..

[32]  Jakub Konecný,et al.  Federated Optimization: Distributed Optimization Beyond the Datacenter , 2015, ArXiv.

[33]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests , 2015, SIGMETRICS.

[34]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests: Exact Analysis , 2015, SIGMETRICS 2015.

[35]  Nitin H. Vaidya,et al.  Byzantine Multi-Agent Optimization: Part I , 2015, ArXiv.

[36]  Hamid Reza Feyzmahdavian,et al.  An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[37]  Gregory W. Wornell,et al.  Using Straggler Replication to Reduce Latency in Large-scale Parallel Computing , 2015, PERV.

[38]  Wei Shi,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, SIAM J. Optim..

[39]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[40]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[41]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[42]  Nihar B. Shah,et al.  When do redundant requests reduce latency ? , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[43]  Shai Shalev-Shwartz,et al.  Accelerated Mini-Batch Stochastic Dual Coordinate Ascent , 2013, NIPS.

[44]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[45]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[46]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[47]  Lisandro Dalcin,et al.  Parallel distributed computing using Python , 2011 .

[48]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[49]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[50]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[51]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[52]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[53]  Robert Nowak,et al.  Distributed optimization in sensor networks , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[54]  Stephen P. Boyd,et al.  Distributed optimization for cooperative agents: application to formation flight , 2004, 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No.04CH37601).

[55]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[56]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[57]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[58]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[59]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[60]  W. Rudin Principles of mathematical analysis , 1964 .