Decentralized SGD with Asynchronous, Local and Quantized Updates.
暂无分享,去创建一个
Dan Alistarh | Peter Davies | G. Nadiradze | Shigang Li | Amirmojtaba Sabour | Ilia Markov | Giorgi Nadiradze | I. Markov
[1] John N. Tsitsiklis,et al. Problems in decentralized decision making and computation , 1984 .
[2] S. Muthukrishnan,et al. Dynamic Load Balancing by Random Matchings , 1996, J. Comput. Syst. Sci..
[3] Johannes Gehrke,et al. Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..
[4] Stephen P. Boyd,et al. Fast linear iterations for distributed averaging , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).
[5] Michael J. Fischer,et al. Computation in networks of passively mobile finite-state sensors , 2004, PODC '04.
[6] Stephen P. Boyd,et al. Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.
[7] Mikael Johansson,et al. A Randomized Incremental Subgradient Method for Distributed Optimization in Networked Systems , 2009, SIAM J. Optim..
[8] Zengjian Hu,et al. A new analytical method for parallel, diffusion-type load balancing , 2009, J. Parallel Distributed Comput..
[9] Asuman E. Ozdaglar,et al. Distributed Subgradient Methods for Multi-Agent Optimization , 2009, IEEE Transactions on Automatic Control.
[10] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.
[11] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.
[12] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.
[13] Torsten Hoefler,et al. Slim Fly: A Cost Effective Low-Diameter Network Topology , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[14] Ohad Shamir,et al. Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).
[15] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.
[16] Yijun Huang,et al. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.
[17] John C. Duchi,et al. Asynchronous stochastic convex optimization , 2015, 1508.00882.
[18] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Qiang Huo,et al. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[21] Sam Ade Jacobs,et al. Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).
[22] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[23] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[24] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.
[25] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.
[26] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.
[27] Dan Alistarh,et al. Synchronous Multi-GPU Deep Learning with Low-Precision Communication: An Experimental Study , 2018 .
[28] Wei Zhang,et al. Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.
[29] Jianyu Wang,et al. Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms , 2018, ArXiv.
[30] Hanlin Tang,et al. Decentralization Meets Quantization , 2018, ArXiv.
[31] Martin Jaggi,et al. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.
[32] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.
[33] Michael G. Rabbat,et al. Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.
[34] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..
[35] Christopher De Sa,et al. Moniqua: Modulo Quantized Communication in Decentralized SGD , 2020, ICML.
[36] Dan Alistarh,et al. Distributed Variance Reduction with Optimal Communication , 2020 .
[37] Dan Alistarh,et al. Taming unbalanced training workloads in deep learning with partial collective operations , 2019, PPoPP.
[38] Tao Lin,et al. Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.
[39] Martin Jaggi,et al. A Unified Theory of Decentralized SGD with Changing Topology and Local Updates , 2020, ICML.
[40] Ce Zhang,et al. Distributed Learning Systems with First-Order Methods , 2020, Found. Trends Databases.