SPARQ-SGD: Event-Triggered and Compressed Communication in Decentralized Stochastic Optimization

In this paper, we propose and analyze SPARQ-SGD, which is an event-triggered and compressed algorithm for decentralized training of large-scale machine learning models. Each node can locally compute a condition (event) which triggers a communication where quantized and sparsified local model parameters are sent. In SPARQ-SGD each node takes at least a fixed number ($H$) of local gradient steps and then checks if the model parameters have significantly changed compared to its last update; it communicates further compressed model parameters only when there is a significant change, as specified by a (design) criterion. We prove that the SPARQ-SGD converges as $O(\frac{1}{nT})$ and $O(\frac{1}{\sqrt{nT}})$ in the strongly-convex and non-convex settings, respectively, demonstrating that such aggressive compression, including event-triggered communication, model sparsification and quantization does not affect the overall convergence rate as compared to uncompressed decentralized training; thereby theoretically yielding communication efficiency for "free". We evaluate SPARQ-SGD over real datasets to demonstrate significant amount of savings in communication over the state-of-the-art.

[1]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[2]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[3]  Sonia Martínez,et al.  Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication , 2014, Autom..

[4]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[5]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[6]  Karl Henrik Johansson,et al.  Distributed Event-Triggered Control for Multi-Agent Systems , 2012, IEEE Transactions on Automatic Control.

[7]  Hanlin Tang,et al.  Communication Compression for Decentralized Training , 2018, NeurIPS.

[8]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[9]  Ananda Theertha Suresh,et al.  Distributed Mean Estimation with Limited Communication , 2016, ICML.

[10]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[11]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[12]  Gregory F. Coppola Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing , 2015 .

[13]  Wei Shi,et al.  Expander graph and communication-efficient decentralized optimization , 2016, 2016 50th Asilomar Conference on Signals, Systems and Computers.

[14]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[15]  Paulo Tabuada,et al.  An introduction to event-triggered and self-triggered control , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[16]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[17]  Karl Henrik Johansson,et al.  Event-based broadcasting for multi-agent average consensus , 2013, Autom..

[18]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[19]  Behrouz Touri,et al.  Non-Convex Distributed Optimization , 2015, IEEE Transactions on Automatic Control.

[20]  Michael G. Rabbat,et al.  Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.

[21]  Wei Ren,et al.  Event-triggered zero-gradient-sum distributed consensus optimization over directed networks , 2016, Autom..

[22]  Aryan Mokhtari,et al.  Quantized Decentralized Consensus Optimization , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[23]  Qing Ling,et al.  Asynchronous periodic event-triggered coordination of multi-agent systems , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[24]  Georgios B. Giannakis,et al.  LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning , 2018, NeurIPS.

[25]  N. Linial,et al.  Expander Graphs and their Applications , 2006 .

[26]  Karl Henrik Johansson,et al.  Distributed Optimization with Dynamic Event-Triggered Mechanisms , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[27]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[28]  Shenghuo Zhu,et al.  Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning , 2018, AAAI.

[29]  Wei Zhang,et al.  Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[30]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[31]  Antoine Girard,et al.  Dynamic Triggering Mechanisms for Event-Triggered Control , 2013, IEEE Transactions on Automatic Control.

[32]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[33]  Sebastian U. Stich,et al.  Local SGD Converges Fast and Communicates Little , 2018, ICLR.

[34]  Martin Jaggi,et al.  Decentralized Deep Learning with Arbitrary Communication Compression , 2019, ICLR.