Towards Optimal Convergence Rate in Decentralized Stochastic Training

Parallel training with decentralized communication is a promising method of scaling up machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. This lower bound reveals a theoretical gap in known convergence rates of many existing algorithms. To show this bound is tight and achievable, we propose DeFacto, a class of algorithms that converge at the optimal rate without additional theoretical assumptions. We discuss the trade-offs among different algorithms regarding complexity, memory efficiency, throughput, etc. Empirically, we compare DeFacto and other decentralized algorithms via training Resnet20 on CIFAR10 and Resnet110 on CIFAR100. We show DeFacto can accelerate training with respect to wall-clock time but progresses slowly in the first few epochs.

[1]  Anit Kumar Sahu,et al.  MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling , 2019, 2019 Sixth Indian Control Conference (ICC).

[2]  Ohad Shamir,et al.  Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[3]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[4]  Andrew Gelman,et al.  Handbook of Markov Chain Monte Carlo , 2011 .

[5]  Eric Balkanski,et al.  Parallelization does not Accelerate Convex Optimization: Adaptivity Lower Bounds for Non-smooth Convex Minimization , 2018, ArXiv.

[6]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[7]  Quanquan Gu,et al.  Lower Bounds for Smooth Nonconvex Finite-Sum Optimization , 2019, ICML.

[8]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[9]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[10]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[11]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[12]  Aditya Sane,et al.  Machine learning for predictive maintenance of industrial machines using IoT sensor data , 2017, 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS).

[13]  Léon Bottou,et al.  A Lower Bound for the Optimization of Finite Sums , 2014, ICML.

[14]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[15]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[16]  Laurent Massoulié,et al.  Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[17]  N. U. Prabhu,et al.  Stochastic Processes and Their Applications , 1999 .

[18]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[19]  Ohad Shamir,et al.  Oracle Complexity of Second-Order Methods for Finite-Sum Problems , 2016, ICML.

[20]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[21]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[22]  Michael G. Rabbat,et al.  Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[23]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[24]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[25]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[26]  Leonidas Georgopoulos,et al.  Definitive Consensus for Distributed Data Inference , 2011 .

[27]  Stephen P. Boyd,et al.  Gossip algorithms: design, analysis and applications , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jiaqi Zhang,et al.  Asynchronous Decentralized Optimization in Directed Networks , 2019, ArXiv.

[30]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[31]  Ludovic Dos Santos,et al.  Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning , 2019, NeurIPS.

[32]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[33]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[34]  Leonard J. Schulman,et al.  On matrix factorization and scheduling for finite-time average-consensus , 2010 .

[35]  Christopher De Sa,et al.  MixML: A Unified Analysis of Weakly Consistent Parallel Learning , 2020, ArXiv.

[36]  Martin Jaggi,et al.  Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.

[37]  Martin Jaggi,et al.  COLA: Decentralized Linear Learning , 2018, NeurIPS.

[38]  Jianyu Wang,et al.  Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[39]  Nathan Srebro,et al.  Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[40]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[41]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[42]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.

[43]  Raphaël M. Jungers,et al.  Graph diameter, eigenvalues, and minimum-time consensus , 2012, Autom..

[44]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[45]  Dan Alistarh A Brief Tutorial on Distributed and Concurrent Machine Learning , 2018, PODC.

[46]  Yair Carmon,et al.  Lower bounds for finding stationary points II: first-order methods , 2017, Mathematical Programming.

[47]  Christopher De Sa,et al.  Moniqua: Modulo Quantized Communication in Decentralized SGD , 2020, ICML.

[48]  Jelena Diakonikolas,et al.  Lower Bounds for Parallel and Randomized Convex Optimization , 2018, COLT.

[49]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[50]  Andreas Spanias,et al.  A brief survey of machine learning methods and their sensor and IoT applications , 2017, 2017 8th International Conference on Information, Intelligence, Systems & Applications (IISA).

[51]  Volkan Cevher,et al.  An adaptive primal-dual framework for nonsmooth convex minimization , 2018, Mathematical Programming Computation.

[52]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[53]  Lei Yuan,et al.  $\texttt{DeepSqueeze}$: Decentralization Meets Error-Compensated Compression , 2019 .

[54]  Laurent Massoulié,et al.  Optimal Algorithms for Non-Smooth Distributed Optimization in Networks , 2018, NeurIPS.

[55]  Dimitris S. Papailiopoulos,et al.  ATOMO: Communication-efficient Learning via Atomic Sparsification , 2018, NeurIPS.

[56]  George Michailidis,et al.  DAdam: A Consensus-Based Distributed Adaptive Gradient Method for Online Optimization , 2018, IEEE Transactions on Signal Processing.

[57]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[58]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[59]  Kunle Olukotun,et al.  Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms , 2015, NIPS.

[60]  B. Gerencsér Markov chain mixing time on cycles , 2011 .

[61]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[62]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[63]  Ohad Shamir,et al.  The Complexity of Making the Gradient Small in Stochastic Convex Optimization , 2019, COLT.

[64]  Martin Jaggi,et al.  A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.

[65]  Mingyi Hong,et al.  Distributed Non-Convex First-Order optimization and Information Processing: Lower Complexity Bounds and Rate Optimal Algorithms , 2018, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[66]  Ji Liu,et al.  DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression , 2019, ICML.

[67]  Dan Alistarh,et al.  Distributed Learning over Unreliable Networks , 2018, ICML.

[68]  Nathan Srebro,et al.  Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization , 2018, NeurIPS.

[69]  Dan Alistarh,et al.  Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent , 2020, ArXiv.

[70]  A. Gasnikov,et al.  Decentralized and Parallelized Primal and Dual Accelerated Methods for Stochastic Convex Programming Problems , 2019, 1904.09015.

[71]  Xiaoxia Wu,et al.  L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[72]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.