论文信息 - SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic Optimization

SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic Optimization

In this paper, we propose and analyze SPARQ-SGD, which is an event-triggered and compressed algorithm for decentralized training of large-scale machine learning models. Each node can locally compute a condition (event) which triggers a communication where quantized and sparsified local model parameters are sent. In SPARQ-SGD each node takes at least a fixed number ($H$) of local gradient steps and then checks if the model parameters have significantly changed compared to its last update; it communicates further compressed model parameters only when there is a significant change, as specified by a (design) criterion. We prove that the SPARQ-SGD converges as $O(\frac{1}{nT})$ and $O(\frac{1}{\sqrt{nT}})$ in the strongly-convex and non-convex settings, respectively, demonstrating that such aggressive compression, including event-triggered communication, model sparsification and quantization does not affect the overall convergence rate as compared to uncompressed decentralized training; thereby theoretically yielding communication efficiency for "free". We evaluate SPARQ-SGD over real datasets to demonstrate significant amount of savings in communication over the state-of-the-art.

Suhas N. Diggavi | Jemin George | Deepesh Data | Navjot Singh

[1] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[2] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[3] Sonia Martínez,et al. Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication , 2014, Autom..

[4] Ohad Shamir,et al. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[5] Suhas Diggavi,et al. Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[6] Karl Henrik Johansson,et al. Distributed Event-Triggered Control for Multi-Agent Systems , 2012, IEEE Transactions on Automatic Control.

[7] Hanlin Tang,et al. Communication Compression for Decentralized Training , 2018, NeurIPS.

[8] Dan Alistarh,et al. The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[9] Ananda Theertha Suresh,et al. Distributed Mean Estimation with Limited Communication , 2016, ICML.

[10] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[11] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[12] Gregory F. Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing , 2015 .

[13] Wei Shi,et al. Expander graph and communication-efficient decentralized optimization , 2016, 2016 50th Asilomar Conference on Signals, Systems and Computers.

[14] Rong Jin,et al. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[15] Paulo Tabuada,et al. An introduction to event-triggered and self-triggered control , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[16] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.

[17] Karl Henrik Johansson,et al. Event-based broadcasting for multi-agent average consensus , 2013, Autom..

[18] Martin Jaggi,et al. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[19] Behrouz Touri,et al. Non-Convex Distributed Optimization , 2015, IEEE Transactions on Automatic Control.

[20] Michael G. Rabbat,et al. Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[21] Wei Ren,et al. Event-triggered zero-gradient-sum distributed consensus optimization over directed networks , 2016, Autom..

[22] Aryan Mokhtari,et al. Quantized Decentralized Consensus Optimization , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[23] Qing Ling,et al. Asynchronous periodic event-triggered coordination of multi-agent systems , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[24] Georgios B. Giannakis,et al. LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning , 2018, NeurIPS.

[25] N. Linial,et al. Expander Graphs and their Applications , 2006 .

[26] Karl Henrik Johansson,et al. Distributed Optimization with Dynamic Event-Triggered Mechanisms , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[27] Yiran Chen,et al. Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[28] Shenghuo Zhu,et al. Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[29] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[30] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[31] Antoine Girard,et al. Dynamic Triggering Mechanisms for Event-Triggered Control , 2013, IEEE Transactions on Automatic Control.

[32] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[33] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[34] Martin Jaggi,et al. Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.