论文信息 - Byzantine Stochastic Gradient Descent

Byzantine Stochastic Gradient Descent

This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$-approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sampling complexity and time complexity.

[1] J. Tukey. Mathematics and the Picturing of Data , 1975 .

[2] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.

[3] Aharon Ben-Tal,et al. Lectures on modern convex optimization , 1987 .

[4] Silvio Micali,et al. Optimal algorithms for Byzantine agreement , 1988, STOC '88.

[5] I. Pinelis. OPTIMUM BOUNDS FOR THE DISTRIBUTIONS OF MARTINGALES IN BANACH SPACES , 1994, 1208.2200.

[6] B. Ripley,et al. Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[7] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[8] Elad Hazan,et al. An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[9] John Langford,et al. Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[10] Ohad Shamir,et al. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[11] Trac D. Tran,et al. Robust Lasso With Missing and Grossly Corrupted Observations , 2011, IEEE Transactions on Information Theory.

[12] Trac D. Tran,et al. Exact Recoverability From Dense Corrupted Observations via $\ell _{1}$-Minimization , 2011, IEEE Transactions on Information Theory.

[13] Shie Mannor,et al. Distributed Robust Learning , 2014, ArXiv.

[14] Ohad Shamir,et al. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[15] Tianbao Yang,et al. Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity , 2015 .

[16] Prateek Jain,et al. Robust Regression via Hard Thresholding , 2015, NIPS.

[17] Yuchen Zhang,et al. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[18] Nathan Srebro,et al. Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[19] Saeed Ghadimi,et al. Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[20] Santosh S. Vempala,et al. Agnostic Estimation of Mean and Covariance , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[21] Peter Richtárik,et al. Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[22] Daniel M. Kane,et al. Robust Estimators in High Dimensions without the Computational Intractability , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[23] Nitin H. Vaidya,et al. Fault-Tolerant Multi-Agent Optimization: Optimal Iterative Distributed Algorithms , 2016, PODC.

[24] Alexander J. Smola,et al. AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[25] Prateek Jain,et al. Consistent Robust Regression , 2017, NIPS.

[26] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[27] Jerry Li,et al. Computationally Efficient Robust Sparse Estimation in High Dimensions , 2017, COLT.

[28] Gregory Valiant,et al. Learning from untrusted data , 2016, STOC.

[29] Rachid Guerraoui,et al. Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[30] Lili Su,et al. Distributed Statistical Machine Learning in Adversarial Settings , 2017, Proc. ACM Meas. Anal. Comput. Syst..

[31] Ananda Theertha Suresh,et al. Distributed Mean Estimation with Limited Communication , 2016, ICML.

[32] Jerry Li,et al. Robustly Learning a Gaussian: Getting Optimal Error, Efficiently , 2017, SODA.