暂无分享,去创建一个
[1] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.
[2] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.
[3] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .
[4] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Masafumi Yamazaki,et al. Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds , 2019, ArXiv.
[6] Eric P. Xing,et al. GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.
[7] Parijat Dube,et al. Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.
[8] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[9] Bruno Sericola,et al. Distributed deep learning on edge-devices: Feasibility via adaptive compression , 2017, 2017 IEEE 16th International Symposium on Network Computing and Applications (NCA).
[10] Ji Liu,et al. Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.
[11] Nenghai Yu,et al. Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.
[12] Geoffrey E. Hinton. Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.
[13] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .
[14] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.
[15] William Chan,et al. Distributed asynchronous optimization of convolutional neural networks , 2014, INTERSPEECH.
[16] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[17] Yann LeCun,et al. Deep learning with Elastic Averaging SGD , 2014, NIPS.
[18] Jascha Sohl-Dickstein,et al. Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..
[19] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[20] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[21] Jiawei Jiang,et al. Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.
[22] Assaf Schuster,et al. Taming Momentum in a Distributed Asynchronous Environment , 2019, ArXiv.
[23] Ioannis Mitliagkas,et al. Asynchrony begets momentum, with an application to deep learning , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).
[24] Howard Jay Siegel,et al. Task execution time modeling for heterogeneous computing systems , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).
[25] Kenneth Heafield,et al. Making Asynchronous Stochastic Gradient Descent Work for Transformers , 2019, NGT@EMNLP-IJCNLP.
[26] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.
[27] Tao Wang,et al. Image Classification at Supercomputer Scale , 2018, ArXiv.
[28] Yijun Huang,et al. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.