论文信息 - Efficient-Adam: Communication-Efficient Distributed Adam with Complexity Analysis

Efficient-Adam: Communication-Efficient Distributed Adam with Complexity Analysis

Distributed adaptive stochastic gradient methods have been widely used for large-scale nonconvex optimization, such as training deep learning models. However, their communication complexity on ﬁnding ε stationary points has rarely been analyzed in the nonconvex setting. In this work, we present a novel communication-eﬃcient distributed Adam in the parameter-server model for stochastic nonconvex optimization, dubbed Eﬃcient-Adam . Speciﬁcally, we incorporate a two-way quantization scheme into Eﬃcient-Adam to reduce the communication cost between the workers and server. Simultaneously, we adopt a two-way error feedback strategy to reduce the biases caused by the two-way quantization on both the server and workers, respectively. In addition, we establish the iteration complexity for the proposed Eﬃcient-Adam with a class of quantization operators, and further characterize its communication complexity between the server and workers when an ε stationary point is achieved. Finally, we apply Eﬃcient-Adam to solve a toy stochastic convex optimization problem and train deep learning models on real-world vision and language tasks. Extensive experiments together with a theoretical guarantee justify the merits of Eﬃcient Adam.

Z. Luo | Li Shen | Wei Liu | Congliang Chen

[1] King-Sun Fu,et al. IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Fangyu Zou,et al. Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration , 2021, J. Mach. Learn. Res..

[3] Wei Liu,et al. Quantized Adam with Error Feedback , 2020, ACM Trans. Intell. Syst. Technol..

[4] Francis Bach,et al. On the Convergence of Adam and Adagrad , 2020, ArXiv.

[5] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[6] Sebastian U. Stich,et al. The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication , 2019, 1909.05350.

[7] James T. Kwok,et al. Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback , 2019, NeurIPS.

[8] Martin Jaggi,et al. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[9] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[10] George Michailidis,et al. DAdam: A Consensus-Based Distributed Adaptive Gradient Method for Online Optimization , 2018, IEEE Transactions on Signal Processing.

[11] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[12] James T. Kwok,et al. Analysis of Quantized Models , 2019, ICLR.

[13] Peng Jiang,et al. A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication , 2018, NeurIPS.

[14] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.

[15] Li Shen,et al. Weighted AdaGrad with Unified Momentum , 2018 .

[16] Xiaoxia Wu,et al. L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[17] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[18] Sashank J. Reddi,et al. On the Convergence of Adam and Beyond , 2018, ICLR.

[19] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[20] Ji Liu,et al. Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[21] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[22] Wei Zhang,et al. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent , 2017, NIPS.

[23] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[24] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[25] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[26] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[28] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[30] Alexander J. Smola,et al. Efficient mini-batch training for stochastic optimization , 2014, KDD.

[31] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[32] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33] John C. Duchi,et al. Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[34] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[35] Alexander J. Smola,et al. An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[36] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[37] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[38] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[39] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.