On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length
暂无分享,去创建一个
Yoshua Bengio | Amos Storkey | Asja Fischer | Nicolas Ballas | Stanislaw Jastrzkebski | Zachary Kenton | Yoshua Bengio | A. Storkey | Nicolas Ballas | Stanislaw Jastrzebski | Zachary Kenton | Asja Fischer | Z. Kenton
[1] C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .
[2] Shun-ichi Amari,et al. Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.
[3] Jürgen Schmidhuber,et al. Flat Minima , 1997, Neural Computation.
[4] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.
[5] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.
[6] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.
[7] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.
[8] Oriol Vinyals,et al. Qualitatively characterizing neural network optimization problems , 2014, ICLR.
[9] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[10] Yann LeCun,et al. Singularity of the Hessian in Deep Learning , 2016, ArXiv.
[11] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[12] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[13] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[14] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.
[15] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[16] Masato Taki,et al. Deep Residual Networks and Weight Initialization , 2017, ArXiv.
[17] Nathan Srebro,et al. Exploring Generalization in Deep Learning , 2017, NIPS.
[18] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.
[19] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.
[20] Lorenzo Rosasco,et al. Theory of Deep Learning III: explaining the non-overfitting puzzle , 2017, ArXiv.
[21] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.
[22] Huan Wang,et al. Identifying Generalization Properties in Neural Networks , 2018, ArXiv.
[23] Carla P. Gomes,et al. Understanding Batch Normalization , 2018, NeurIPS.
[24] Yoshua Bengio,et al. A Walk with SGD , 2018, ArXiv.
[25] Stefano Soatto,et al. Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).
[26] Lei Wu,et al. The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent , 2018, ArXiv.
[27] Chunpeng Wu,et al. SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning , 2018, 1805.07898.
[28] Kurt Keutzer,et al. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.
[29] Andrew M. Saxe,et al. High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.