Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart? Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.

[1]  Xiaojing Ye,et al.  Consensus optimization with delayed and stochastic gradients on decentralized networks , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[2]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[3]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[4]  Choon Yik Tang,et al.  A gossip algorithm for convex consensus optimization over networks , 2010, Proceedings of the 2010 American Control Conference.

[5]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[6]  Cho-Jui Hsieh,et al.  A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order , 2016, NIPS.

[7]  Pascal Bianchi,et al.  Performance of a Distributed Stochastic Approximation Algorithm , 2012, IEEE Transactions on Information Theory.

[8]  Angelia Nedic,et al.  Asynchronous gossip algorithms for stochastic optimization , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[9]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[10]  James T. Kwok,et al.  Asynchronous Distributed ADMM for Consensus Optimization , 2014, ICML.

[11]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[12]  Luca Schenato,et al.  A distributed consensus protocol for clock synchronization in wireless sensor network , 2007, 2007 46th IEEE Conference on Decision and Control.

[13]  Angelia Nedic,et al.  Distributed subgradient projection algorithm for convex optimization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  A. Nedić,et al.  Asynchronous Gossip Algorithm for Stochastic Optimization: Constant Stepsize Analysis* , 2010 .

[15]  Anand D. Sarwate,et al.  Broadcast Gossip Algorithms for Consensus , 2009, IEEE Transactions on Signal Processing.

[16]  Angelia Nedic,et al.  Distributed Asynchronous Constrained Stochastic Optimization , 2011, IEEE Journal of Selected Topics in Signal Processing.

[17]  Ali H. Sayed,et al.  Decentralized Consensus Optimization With Asynchrony and Delays , 2016, IEEE Transactions on Signal and Information Processing over Networks.

[18]  Bowen Zhou,et al.  Distributed Deep Learning for Question Answering , 2015, CIKM.

[19]  Ruggero Carli,et al.  Gossip consensus algorithms via quantized communication , 2009, Autom..

[20]  Angelia Nedic,et al.  Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization , 2008, J. Optim. Theory Appl..

[21]  Reza Olfati-Saber,et al.  Consensus and Cooperation in Networked Multi-Agent Systems , 2007, Proceedings of the IEEE.

[22]  Hamid Reza Feyzmahdavian,et al.  An asynchronous mini-batch algorithm for regularized stochastic optimization , 2015, 2015 54th IEEE Conference on Decision and Control (CDC).

[23]  Suyog Gupta,et al.  Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[24]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[25]  Rong Jin,et al.  Regret bounded by gradual variation for online convex optimization , 2014, Machine Learning.

[26]  Sandro Zampieri,et al.  Randomized consensus algorithms over large scale networks , 2007, 2007 Information Theory and Applications Workshop.

[27]  Qing Ling,et al.  On the Convergence of Decentralized Gradient Descent , 2013, SIAM J. Optim..

[28]  Asuman E. Ozdaglar,et al.  Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.

[29]  Yandong Wang,et al.  GaDei: On Scale-Up Training as a Service for Deep Learning , 2016, 2017 IEEE International Conference on Data Mining (ICDM).

[30]  Qing Ling,et al.  EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization , 2014, 1404.6264.

[31]  Ji Liu,et al.  Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.

[32]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[33]  Chih-Jen Lin,et al.  A fast parallel SGD for matrix factorization in shared memory systems , 2013, RecSys.

[34]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[35]  Feng Yan,et al.  Distributed Autonomous Online Learning: Regrets and Intrinsic Privacy-Preserving Properties , 2010, IEEE Transactions on Knowledge and Data Engineering.

[36]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[37]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[38]  Stephen P. Boyd,et al.  Gossip algorithms: design, analysis and applications , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[39]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[40]  Aryan Mokhtari,et al.  DSA: Decentralized Double Stochastic Averaging Gradient Algorithm , 2015, J. Mach. Learn. Res..

[41]  N. S. Aybat,et al.  Distributed Linearized Alternating Direction Method of Multipliers for Composite Convex Consensus Optimization , 2015, IEEE Transactions on Automatic Control.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[44]  Cho-Jui Hsieh,et al.  HogWild++: A New Mechanism for Decentralized Asynchronous Stochastic Gradient Descent , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[45]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[46]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[47]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[48]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[49]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[50]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .