暂无分享,去创建一个
[1] Yi Yang,et al. A Unified Analysis of Stochastic Momentum Methods for Deep Learning , 2018, IJCAI.
[2] Cho-Jui Hsieh,et al. A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order , 2016, NIPS.
[3] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.
[4] Michael G. Rabbat,et al. Stochastic Gradient Push for Distributed Deep Learning , 2018, ICML.
[5] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[6] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.
[7] Yijun Huang,et al. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.
[8] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[9] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[10] Tuo Zhao,et al. Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization , 2018, NeurIPS.
[11] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.
[12] S. Shankar Sastry,et al. Step Size Matters in Deep Learning , 2018, NeurIPS.
[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Wei Zhang,et al. Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.
[15] Parijat Dube,et al. Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.
[16] Lei Wu,et al. How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective , 2018, NeurIPS.
[17] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[18] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[19] Ji Liu,et al. Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.
[20] Ohad Shamir,et al. A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates , 2018, ALT.
[21] Yi Zhou,et al. Toward Understanding the Impact of Staleness in Distributed Machine Learning , 2018, ICLR.
[22] Ioannis Mitliagkas,et al. Asynchrony begets momentum, with an application to deep learning , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).
[23] Jascha Sohl-Dickstein,et al. Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..
[24] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.
[25] A. W. M. van den Enden,et al. Discrete Time Signal Processing , 1989 .
[26] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[27] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[28] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[29] Assaf Schuster,et al. Taming Momentum in a Distributed Asynchronous Environment , 2019, ArXiv.
[30] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.