Byzantine-Tolerant Machine Learning

The growth of data, the need for scalability and the complexity of models used in modern machine learning calls for distributed implementations. Yet, as of today, distributed machine learning frameworks have largely ignored the possibility of arbitrary (i.e., Byzantine) failures. In this paper, we study the robustness to Byzantine failures at the fundamental level of stochastic gradient descent (SGD), the heart of most machine learning algorithms. Assuming a set of $n$ workers, up to $f$ of them being Byzantine, we ask how robust can SGD be, without limiting the dimension, nor the size of the parameter space. We first show that no gradient descent update rule based on a linear combination of the vectors proposed by the workers (i.e, current approaches) tolerates a single Byzantine failure. We then formulate a resilience property of the update rule capturing the basic requirements to guarantee convergence despite $f$ Byzantine workers. We finally propose Krum, an update rule that satisfies the resilience property aforementioned. For a $d$-dimensional learning problem, the time complexity of Krum is $O(n^2 \cdot (d + \log n))$.

[1]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[2]  M. Herlihy,et al.  Computing in the Presence of Concurrent Solo Executions , 2014, LATIN.

[3]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[4]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[5]  Nitin H. Vaidya,et al.  Non-Bayesian Learning in the Presence of Byzantine Agents , 2016, DISC.

[6]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[7]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[8]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[9]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[10]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[11]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[12]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[13]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[14]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[15]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[16]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[17]  Maurice Herlihy,et al.  Multidimensional approximate agreement in Byzantine asynchronous systems , 2013, STOC '13.

[18]  Nitin H. Vaidya,et al.  Fault-Tolerant Multi-Agent Optimization: Optimal Iterative Distributed Algorithms , 2016, PODC.

[19]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[20]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[21]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[22]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[23]  David Gilmore,et al.  Modeling Order in Neural Word Embeddings at Scale , 2015, ICML.