The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size.
暂无分享,去创建一个
[1] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[2] Ethan Dyer,et al. Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.
[3] J. Wheeler,et al. Modified Moments for Harmonic Solids , 1972 .
[4] D. Böhning. Multinomial logistic regression algorithm , 1992 .
[5] Sho Yaida,et al. Fluctuation-dissipation relations for stochastic gradient descent , 2018, ICLR.
[6] Kaiming He,et al. Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.
[7] Jeffrey Pennington,et al. The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network , 2018, NeurIPS.
[8] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[9] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[10] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Stefano Soatto,et al. Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).
[12] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.
[13] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.
[14] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.
[15] Yousef Saad,et al. Approximating Spectral Densities of Large Matrices , 2013, SIAM Rev..
[16] Kurt Keutzer,et al. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.
[17] C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .
[18] Michael W. Mahoney,et al. Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning , 2018, J. Mach. Learn. Res..
[19] F. Ducastelle,et al. Moments developments and their application to the electronic charge distribution of d bands , 1970 .
[20] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[21] V. Marčenko,et al. DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .
[22] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.
[23] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.
[24] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.
[25] I. Turek. A maximum-entropy approach to the density of states within the recursion method , 1988 .
[26] Victoria Stodden,et al. Making massive computational experiments painless , 2016, 2016 IEEE International Conference on Big Data (Big Data).
[27] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[28] Yousef Saad,et al. Fast Estimation of tr(f(A)) via Stochastic Lanczos Quadrature , 2017, SIAM J. Matrix Anal. Appl..
[29] Dario Amodei,et al. An Empirical Model of Large-Batch Training , 2018, ArXiv.
[30] Shankar Krishnan,et al. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.
[31] Yann LeCun,et al. Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.
[32] David Alan Drabold,et al. Maximum entropy approach for linear scaling in the electronic structure problem. , 1993, Physical review letters.
[33] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.
[34] Vardan Papyan,et al. Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians , 2019, ICML.
[35] Jeffrey Pennington,et al. Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.
[36] Victoria Stodden,et al. Ambitious Data Science Can Be Painless , 2019, ArXiv.
[37] Levent Sagun,et al. A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.