Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

With the increasing demand for large-scale training of machine learning models, consensus-based distributed optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker maintains a local estimate of the optimal parameter vector, and iteratively updates it by waiting and averaging all estimates obtained from its neighbors, and then corrects it on the basis of its local dataset. However, the synchronization phase can be time consuming due to the need to wait for stragglers, i.e., slower workers. An efficient way to mitigate this effect is to let each worker wait only for updates from the fastest neighbors before updating its local parameter. The remaining neighbors are called backup workers. To minimize the globally training time over the network, we propose a fully distributed algorithm to dynamically determine the number of backup workers for each worker. We show that our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers). We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.

[1]  Richard Nock,et al.  Advances and Open Problems in Federated Learning , 2021, Found. Trends Mach. Learn..

[2]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[3]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[4]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[5]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[6]  Michael G. Rabbat,et al.  Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization , 2017, Proceedings of the IEEE.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Laurent Massoulié,et al.  Optimal Algorithms for Non-Smooth Distributed Optimization in Networks , 2018, NeurIPS.

[9]  Giovanni Neglia,et al.  Dynamic Backup Workers for Parallel Machine Learning , 2020, 2020 IFIP Networking Conference (Networking).

[10]  Carlee Joe-Wong,et al.  Towards Flexible Device Participation in Federated Learning for Non-IID Data , 2020, ArXiv.

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  Stephen P. Boyd,et al.  Distributed Average Consensus with Time-Varying Metropolis Weights ? , 2006 .

[13]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[14]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[15]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[16]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[19]  Stephen P. Boyd,et al.  Fast linear iterations for distributed averaging , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[20]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[21]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[22]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[23]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[24]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[25]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[26]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[27]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[28]  Haibo Yang,et al.  Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning , 2021, ICLR.

[29]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[30]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[31]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[32]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[33]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[34]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[35]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[36]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[37]  Mikael Johansson,et al.  A Randomized Incremental Subgradient Method for Distributed Optimization in Networked Systems , 2009, SIAM J. Optim..

[38]  Stéphan Clémençon,et al.  Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions , 2016, ICML.

[39]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[40]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[41]  Frank Wood,et al.  Bayesian Distributed Stochastic Gradient Descent , 2018, NeurIPS.

[42]  Shaohuai Shi,et al.  Communication-Efficient Distributed Deep Learning: A Comprehensive Survey , 2020, ArXiv.

[43]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[44]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[45]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[46]  Xuehai Qian,et al.  Hop: Heterogeneity-aware Decentralized Training , 2019, ASPLOS.

[47]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[48]  Yi Zhou,et al.  Toward Understanding the Impact of Staleness in Distributed Machine Learning , 2018, ICLR.

[49]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.