How Does Learning Rate Decay Help Modern Neural Networks
暂无分享,去创建一个
Michael I. Jordan | Mingsheng Long | Kaichao You | Michael I. Jordan | Jianmin Wang | Mingsheng Long | Jianmin Wang | Kaichao You
[1] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[2] Kilian Q. Weinberger,et al. Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.
[3] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.
[4] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[6] Quoc V. Le,et al. Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Carla P. Gomes,et al. Understanding Batch Normalization , 2018, NeurIPS.
[8] Ioannis Mitliagkas,et al. Accelerated Stochastic Power Iteration , 2017, AISTATS.
[9] Colin Wei,et al. Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.
[10] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[11] Antonio Torralba,et al. Recognizing indoor scenes , 2009, CVPR.
[12] Pietro Perona,et al. The Caltech-UCSD Birds-200-2011 Dataset , 2011 .
[13] Bernhard Schölkopf,et al. Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , 2018, ICML.
[14] Thomas Hofmann,et al. Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization , 2018, AISTATS.
[15] Tengyu Ma,et al. Matrix Completion has No Spurious Local Minimum , 2016, NIPS.
[16] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[17] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Yann LeCun,et al. Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.
[19] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[20] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[21] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[22] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[23] Mingjie Sun,et al. Rethinking the Value of Network Pruning , 2018, ICLR.
[24] Kurt Keutzer,et al. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.
[25] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[26] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.
[27] Fred Zhang,et al. SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.
[28] Matus Telgarsky,et al. The implicit bias of gradient descent on nonseparable data , 2019, COLT.
[29] Yoshua Bengio,et al. Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.
[30] Aleksander Madry,et al. How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.
[31] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[32] Marc Alexa,et al. How do humans sketch objects? , 2012, ACM Trans. Graph..
[33] G. Griffin,et al. Caltech-256 Object Category Dataset , 2007 .
[34] Jon Kleinberg,et al. Transfusion: Understanding Transfer Learning for Medical Imaging , 2019, NeurIPS.
[35] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.
[36] Leslie N. Smith,et al. Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).
[37] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .
[38] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.
[39] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[40] Yuanzhi Li,et al. An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.
[41] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.
[42] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[43] Ivan Laptev,et al. Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[44] Mikhail Belkin,et al. Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.
[45] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[46] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.