Fixing by Mixing: A Recipe for Optimal Byzantine ML under Heterogeneity

Byzantine machine learning (ML) aims to ensure the resilience of distributed learning algorithms to misbehaving (or Byzantine) machines. Although this problem received significant attention, prior works often assume the data held by the machines to be homogeneous, which is seldom true in practical settings. Data heterogeneity makes Byzantine ML considerably more challenging, since a Byzantine machine can hardly be distinguished from a non-Byzantine outlier. A few solutions have been proposed to tackle this issue, but these provide suboptimal probabilistic guarantees and fare poorly in practice. This paper closes the theoretical gap, achieving optimality and inducing good empirical results. In fact, we show how to automatically adapt existing solutions for (homogeneous) Byzantine ML to the heterogeneous setting through a powerful mechanism, we call nearest neighbor mixing (NNM), which boosts any standard robust distributed gradient descent variant to yield optimal Byzantine resilience under heterogeneity. We obtain similar guarantees (in expectation) by plugging NNM in the distributed stochastic heavy ball method, a practical substitute to distributed gradient descent. We obtain empirical results that significantly outperform state-of-the-art Byzantine ML solutions.

[1]  R. Guerraoui,et al.  Byzantine Machine Learning Made Easy by Resilient Averaging of Momentums , 2022, ICML.

[2]  Jun Liu,et al.  On Almost Sure Convergence Rates of Stochastic Gradient Methods , 2022, COLT.

[3]  Ufuk Topcu,et al.  Robust Training in High Dimensions via Block Coordinate Geometric Median Descent , 2021, AISTATS.

[4]  Xi Chen,et al.  Variance Reduced Median-of-Means Estimator for Byzantine-Robust Distributed Inference , 2021, J. Mach. Learn. Res..

[5]  Nitin H. Vaidya,et al.  Approximate Byzantine Fault-Tolerance in Distributed Optimization , 2021, PODC.

[6]  Dan Alistarh,et al.  Byzantine-Resilient Non-Convex Stochastic Gradient Descent , 2020, ICLR.

[7]  Martin Jaggi,et al.  Learning from History for Byzantine Robust Optimization , 2020, ICML.

[8]  R. Guerraoui,et al.  Collaborative Learning in the Jungle (Decentralized, Byzantine, Heterogeneous, Asynchronous and Nonconvex Learning) , 2020, NeurIPS.

[9]  Nitin H. Vaidya,et al.  Fault-Tolerance in Distributed Optimization: The Case of Redundancy , 2020, PODC.

[10]  Sai Praneeth Karimireddy,et al.  Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing , 2020, ICLR.

[11]  Zaïd Harchaoui,et al.  Robust Aggregation for Federated Learning , 2019, IEEE Transactions on Signal Processing.

[12]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2019, Found. Trends Mach. Learn..

[13]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[14]  Tzu-Ming Harry Hsu,et al.  Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification , 2019, ArXiv.

[15]  G. Giannakis,et al.  RSA: Byzantine-Robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets , 2019, Proceedings of the AAAI Conference on Artificial Intelligence.

[16]  Indranil Gupta,et al.  Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation , 2019, UAI.

[17]  Moran Baruch,et al.  A Little Is Enough: Circumventing Defenses For Distributed Learning , 2019, NeurIPS.

[18]  Kannan Ramchandran,et al.  Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[19]  Indranil Gupta,et al.  Generalized Byzantine-tolerant SGD , 2018, ArXiv.

[20]  Rachid Guerraoui,et al.  The Hidden Vulnerability of Distributed Learning in Byzantium , 2018, ICML.

[21]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[22]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[23]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[24]  Lili Su,et al.  Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent , 2019, PERV.

[25]  Gregory Valiant,et al.  Learning from untrusted data , 2016, STOC.

[26]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[27]  S. Gadat,et al.  Stochastic Heavy ball , 2016, 1609.04228.

[28]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[29]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[30]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[32]  Yasin Abbasi-Yadkori,et al.  Fast Approximate Nearest-Neighbor Search with k-Nearest Neighbor Graph , 2011, IJCAI.

[33]  Stephen P. Boyd,et al.  Convex Optimization , 2004, IEEE Transactions on Automatic Control.

[34]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[35]  C. Small A Survey of Multidimensional Medians , 1990 .

[36]  Leslie G. Valiant,et al.  Random Generation of Combinatorial Structures from a Uniform Distribution , 1986, Theor. Comput. Sci..

[37]  J. T. Sims,et al.  The Byzantine Generals Problem , 1982, TOPL.

[38]  R. Guerraoui,et al.  Making Byzantine Decentralized Learning Efficient , 2022, ArXiv.

[39]  Amir Houmansadr,et al.  Manipulating the Byzantine: Optimizing Model Poisoning Attacks and Defenses for Federated Learning , 2021, NDSS.

[40]  Rachid Guerraoui,et al.  Distributed Momentum for Byzantine-resilient Stochastic Gradient Descent , 2021, ICLR.

[41]  Aaron Defazio,et al.  Almost sure convergence rates for Stochastic Gradient Descent and Stochastic Heavy Ball , 2021, COLT.

[42]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[43]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .

[44]  Jerzy Seidler,et al.  Problem Complexity and Method Efficiency in Optimization , 1984 .

[45]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[46]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .