Distributed Learning over Unreliable Networks

Most of today's distributed machine learning systems assume {\em reliable networks}: whenever two machines exchange information (e.g., gradients or models), the network should guarantee the delivery of the message. At the same time, recent work exhibits the impressive tolerance of machine learning algorithms to errors or noise arising from relaxed communication or synchronization. In this paper, we connect these two trends, and consider the following question: {\em Can we design machine learning systems that are tolerant to network unreliability during training?} With this motivation, we focus on a theoretical problem of independent interest---given a standard distributed parameter server architecture, if every communication between the worker and the server has a non-zero probability $p$ of being dropped, does there exist an algorithm that still converges, and at what speed? The technical contribution of this paper is a novel theoretical analysis proving that distributed learning over unreliable network can achieve comparable convergence rate to centralized or distributed learning over reliable networks. Further, we prove that the influence of the packet drop rate diminishes with the growth of the number of \textcolor{black}{parameter servers}. We map this theoretical result onto a real-world scenario, training deep neural networks over an unreliable network layer, and conduct network simulation to validate the system improvement by allowing the networks to be unreliable.

[1]  Kannan Ramchandran,et al.  Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates , 2018, ICML.

[2]  Dan Alistarh,et al.  Byzantine Stochastic Gradient Descent , 2018, NeurIPS.

[3]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[4]  Xiangru Lian,et al.  D2: Decentralized Training over Decentralized Data , 2018, ICML.

[5]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[6]  Wei Shi,et al.  Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs , 2016, SIAM J. Optim..

[7]  Zheng Xu,et al.  Adaptive Consensus ADMM for Distributed Optimization , 2017, ICML.

[8]  Dan Alistarh,et al.  SparCML: high-performance sparse communication for machine learning , 2018, SC.

[9]  Babak Falsafi,et al.  Training DNNs with Hybrid Block Floating Point , 2018, NeurIPS.

[10]  Alexander J. Smola,et al.  AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization , 2015, ArXiv.

[11]  Abhinav Vishnu,et al.  GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent , 2018, ArXiv.

[12]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[13]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[15]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[16]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[17]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[18]  Fabian Pedregosa,et al.  ASAGA: Asynchronous Parallel SAGA , 2016, AISTATS.

[19]  Junzhou Huang,et al.  Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization , 2018, ICML.

[20]  Angelia Nedic,et al.  Distributed optimization over time-varying directed graphs , 2013, 52nd IEEE Conference on Decision and Control.

[21]  Rong Zheng,et al.  Asynchronous stochastic gradient descent for DNN training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[23]  Li Fei-Fei,et al.  Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go? , 2018, ICML.

[24]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[25]  Zhaoran Wang,et al.  NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization , 2016, NIPS.

[26]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[27]  Tom Goldstein,et al.  Unwrapping ADMM: Efficient Distributed Computing via Transpose Reduction , 2015, AISTATS.

[28]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[29]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[30]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Convex Optimization Over Random Networks , 2011, IEEE Transactions on Automatic Control.

[31]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[32]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[33]  Forrest N. Iandola,et al.  How to scale distributed deep learning? , 2016, ArXiv.

[34]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[35]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[36]  Xiaojing Ye,et al.  Consensus optimization with delayed and stochastic gradients on decentralized networks , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[37]  Matthieu Cord,et al.  Gossip training for deep learning , 2016, ArXiv.

[38]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[39]  Stéphan Clémençon,et al.  Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions , 2016, ICML.

[40]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[41]  Martin Jaggi,et al.  COLA: Decentralized Linear Learning , 2018, NeurIPS.

[42]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[43]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[44]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Tao Li,et al.  Consensus Conditions of Multi-Agent Systems With Time-Varying Topologies and Stochastic Communication Noises , 2010, IEEE Transactions on Automatic Control.

[47]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[48]  Mladen Kolar,et al.  Efficient Distributed Learning with Sparsity , 2016, ICML.

[49]  Yi Zhou,et al.  Communication-efficient algorithms for decentralized and stochastic optimization , 2017, Mathematical Programming.

[50]  Aryan Mokhtari,et al.  Towards More Efficient Stochastic Decentralized Learning: Faster Convergence and Sparse Communication , 2018, ICML.