论文信息 - Towards Optimal Convergence Rate in Decentralized Stochastic Training

Towards Optimal Convergence Rate in Decentralized Stochastic Training

Parallel training with decentralized communication is a promising method of scaling up machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. This lower bound reveals a theoretical gap in known convergence rates of many existing algorithms. To show this bound is tight and achievable, we propose DeFacto, a class of algorithms that converge at the optimal rate without additional theoretical assumptions. We discuss the trade-offs among different algorithms regarding complexity, memory efficiency, throughput, etc. Empirically, we compare DeFacto and other decentralized algorithms via training Resnet20 on CIFAR10 and Resnet110 on CIFAR100. We show DeFacto can accelerate training with respect to wall-clock time but progresses slowly in the first few epochs.

Christopher De Sa | Zheng Li | Yucheng Lu

[1] Anit Kumar Sahu,et al. MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling , 2019, 2019 Sixth Indian Control Conference (ICC).

[2] Ohad Shamir,et al. Communication Complexity of Distributed Convex Learning and Optimization , 2015, NIPS.

[3] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[4] Andrew Gelman,et al. Handbook of Markov Chain Monte Carlo , 2011 .

[5] Eric Balkanski,et al. Parallelization does not Accelerate Convex Optimization: Adaptivity Lower Bounds for Non-smooth Convex Minimization , 2018, ArXiv.

[6] Yair Carmon,et al. Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[7] Quanquan Gu,et al. Lower Bounds for Smooth Nonconvex Finite-Sum Optimization , 2019, ICML.

[8] Stephen P. Boyd,et al. Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[9] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[10] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[11] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[12] Aditya Sane,et al. Machine learning for predictive maintenance of industrial machines using IoT sensor data , 2017, 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS).

[13] Léon Bottou,et al. A Lower Bound for the Optimization of Finite Sums , 2014, ICML.

[14] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[15] Rong Jin,et al. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[16] Laurent Massoulié,et al. Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks , 2017, ICML.

[17] N. U. Prabhu,et al. Stochastic Processes and Their Applications , 1999 .

[18] Martin Jaggi,et al. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[19] Ohad Shamir,et al. Oracle Complexity of Second-Order Methods for Finite-Sum Problems , 2016, ICML.

[20] Xin Yuan,et al. Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[21] Yi Zhou,et al. An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[22] Michael G. Rabbat,et al. Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[23] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[24] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[25] Tong Zhang,et al. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[26] Leonidas Georgopoulos,et al. Definitive Consensus for Distributed Data Inference , 2011 .

[27] Stephen P. Boyd,et al. Gossip algorithms: design, analysis and applications , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[28] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Jiaqi Zhang,et al. Asynchronous Decentralized Optimization in Directed Networks , 2019, ArXiv.

[30] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[31] Ludovic Dos Santos,et al. Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning , 2019, NeurIPS.

[32] Yijun Huang,et al. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[33] Alexander J. Smola,et al. Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[34] Leonard J. Schulman,et al. On matrix factorization and scheduling for finite-time average-consensus , 2010 .

[35] Christopher De Sa,et al. MixML: A Unified Analysis of Weakly Consistent Parallel Learning , 2020, ArXiv.

[36] Martin Jaggi,et al. Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.

[37] Martin Jaggi,et al. COLA: Decentralized Linear Learning , 2018, NeurIPS.

[38] Jianyu Wang,et al. Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.

[39] Nathan Srebro,et al. Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.