The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
暂无分享,去创建一个
[1] P. C. Pandey,et al. The Journal of the Acoustical Society of America , 1939 .
[2] Yoav Freund,et al. Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.
[3] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[4] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .
[5] R. Vershynin,et al. A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.
[6] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.
[7] Nicholas I. M. Gould,et al. SIAM Journal on Optimization , 2012 .
[8] Mark W. Schmidt,et al. A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.
[9] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.
[10] DeLiang Wang,et al. An algorithm to improve speech recognition in noise for hearing-impaired listeners. , 2013, The Journal of the Acoustical Society of America.
[11] Avleen Singh Bijral,et al. Mini-Batch Primal and Dual Methods for SVMs , 2013, ICML.
[12] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.
[13] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.
[14] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[15] Alexander J. Smola,et al. Efficient mini-batch training for stochastic optimization , 2014, KDD.
[16] Lin Xiao,et al. A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..
[17] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.
[18] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.
[19] Mu Li. Proposal Scaling Distributed Machine Learning with System and Algorithm Co-design , 2016 .
[20] Eugenio Culurciello,et al. An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.
[21] Prateek Jain,et al. Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging , 2016, ArXiv.
[22] Zeyuan Allen-Zhu,et al. Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..
[23] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[24] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[25] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[27] D. Colella. Journal of Fourier Analysis and Applications , 2017 .
[28] Matus Telgarsky,et al. Spectrally-normalized margin bounds for neural networks , 2017, NIPS.
[29] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.
[30] Mikhail Belkin,et al. Diving into the shallows: a computational perspective on large-scale shallow learning , 2017, NIPS.
[31] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.
[32] Mikhail Belkin,et al. To understand deep learning we need to understand kernel learning , 2018, ICML.
[33] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.
[34] Dimitris S. Papailiopoulos,et al. Gradient Diversity: a Key Ingredient for Scalable Distributed Learning , 2017, AISTATS.
[35] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.
[36] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..