暂无分享,去创建一个
Hao Zhou | Xunpeng Huang | Zhe Wang | Runxin Xu | Zhengyang Liu | Lei Li | Lei Li | Hao Zhou | Runxin Xu | Xunpeng Huang | Zhe Wang | Zhengyang Liu
[1] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[2] Jinghui Chen,et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.
[3] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[4] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.
[5] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.
[6] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[7] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Yurii Nesterov,et al. Cubic regularization of Newton method and its global performance , 2006, Math. Program..
[9] Andreas Veit,et al. Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.
[10] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[12] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.
[13] Yoram Singer,et al. Memory Efficient Adaptive Optimization , 2019, NeurIPS.
[14] W. H. Young. On Classes of Summable Functions and their Fourier Series , 1912 .
[15] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[16] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Lin Ma,et al. Normalized Direction-preserving Adam , 2017, ArXiv.
[18] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[19] Yoram Singer,et al. Memory-Efficient Adaptive Optimization for Large-Scale Learning , 2019, ArXiv.
[20] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[21] Sashank J. Reddi,et al. Why ADAM Beats SGD for Attention Models , 2019, ArXiv.
[22] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.