Toward Communication Efficient Adaptive Gradient Method
暂无分享,去创建一个
Ping Li | Xiaoyun Li | Xiangyi Chen | P. Li | Xiangyi Chen | Xiaoyun Li
[1] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.
[2] Sebastian U. Stich,et al. Local SGD Converges Fast and Communicates Little , 2018, ICLR.
[3] Sanjiv Kumar,et al. Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.
[4] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[5] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[6] Peter Richtárik,et al. Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.
[7] David J. Slate,et al. Letter Recognition Using Holland-Style Adaptive Classifiers , 1991, Machine Learning.
[8] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.
[9] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.
[10] Ping Li,et al. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search , 2019, KDD.
[11] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[12] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.
[13] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.
[14] Indranil Gupta,et al. Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates , 2019, ArXiv.
[15] Xiaoxia Wu,et al. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.
[16] Ping Li,et al. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.
[17] Li Shen,et al. On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks , 2018, ArXiv.
[18] Yuan Cao,et al. On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.
[19] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.
[20] Rong Jin,et al. On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.
[21] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[22] Aryan Mokhtari,et al. FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization , 2019, AISTATS.
[23] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .
[24] Ji Liu,et al. Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.
[25] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.
[26] Yi Zhang,et al. Efficient Full-Matrix Adaptive Regularization , 2020, ICML.
[27] Enhong Chen,et al. Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions , 2018, ICLR.
[28] Francesco Orabona,et al. On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.
[29] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[30] Xu Li,et al. Improved Touch-screen Inputting Using Sequence-level Prediction Generation , 2020, WWW.
[31] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.
[32] Li Shen,et al. A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Farzin Haddadpour,et al. Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization , 2019, ICML.
[34] Jie Chen,et al. Asynchronous parallel adaptive stochastic gradient methods , 2020, ArXiv.
[35] Xiaoyun Li,et al. FedSKETCH: Communication-Efficient and Private Federated Learning via Sketching , 2020, ArXiv.
[36] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.
[37] Manzil Zaheer,et al. Adaptive Federated Optimization , 2020, ICLR.
[38] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.
[39] Andreas Krause,et al. Advances in Neural Information Processing Systems (NIPS) , 2014 .
[40] Blaise Agüera y Arcas,et al. Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.
[41] Fan Zhou,et al. On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization , 2017, IJCAI.
[42] Sanjiv Kumar,et al. Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.