暂无分享,去创建一个
Yoshua Bengio | Devansh Arpit | Chen Xing | Christos Tsirigotis | Yoshua Bengio | Devansh Arpit | Chen Xing | Christos Tsirigotis
[1] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .
[2] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .
[3] Sompolinsky,et al. Spin-glass models of neural networks. , 1985, Physical review. A, General physics.
[4] V. Dotsenko. An Introduction to the Theory of Spin Glasses and Neural Networks , 1995 .
[5] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.
[6] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .
[7] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[8] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[9] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.
[10] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.
[11] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.
[12] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[13] Oriol Vinyals,et al. Qualitatively characterizing neural network optimization problems , 2014, ICLR.
[14] Quoc V. Le,et al. Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.
[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[16] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.
[17] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.
[18] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[19] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[20] Yoram Singer,et al. Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.
[21] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Ohad Shamir,et al. On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.
[23] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[24] Joan Bruna,et al. Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.
[25] Zhanxing Zhu,et al. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.
[26] E Weinan,et al. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.
[27] Leslie N. Smith,et al. Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).
[28] Leslie Pack Kaelbling,et al. Generalization in Deep Learning , 2017, ArXiv.
[29] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[30] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.
[31] Yoshua Bengio,et al. A Closer Look at Memorization in Deep Networks , 2017, ICML.
[32] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[33] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[34] Stefano Soatto,et al. Emergence of invariance and disentangling in deep representations , 2017 .
[35] Quoc V. Le,et al. Understanding Generalization and Stochastic Gradient Descent , 2017 .
[36] Nathan Srebro,et al. Exploring Generalization in Deep Learning , 2017, NIPS.
[37] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..
[38] Jian-Guo Liu,et al. Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent , 2017, ArXiv.
[39] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.
[40] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.
[41] Stefano Soatto,et al. Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).
[42] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.
[43] Stefano Soatto,et al. Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).
[44] Lei Wu,et al. The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent , 2018, ArXiv.
[45] Quoc V. Le,et al. A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.
[46] Wenqing Hu,et al. On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.
[47] Nicholay Topin,et al. Super-convergence: very fast training of neural networks using large learning rates , 2018, Defense + Commercial Sensing.
[48] Andrew M. Saxe,et al. High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.