暂无分享,去创建一个
Guangwen Yang | Zhiqiang Liu | Wei Xue | Haohuan Fu | Hao Jing | Wenlai Zhao | Liang Qiao | Yushu Chen | H. Fu | Guangwen Yang | Yushu Chen | Wei Xue | Wenlai Zhao | Hao Jing | Zhiqiang Liu | Li'an Qiao
[1] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .
[2] Prateek Jain,et al. On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).
[3] J. Nocedal. Updating Quasi-Newton Matrices With Limited Storage , 1980 .
[4] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .
[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[7] Geoffrey E. Hinton,et al. Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.
[8] Lijun Zhang,et al. SAdam: A Variant of Adam for Strongly Convex Functions , 2019, ICLR.
[9] Andrea Vedaldi,et al. Small Steps and Giant Leaps: Minimal Newton Solvers for Deep Learning , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[10] Prateek Jain,et al. Accelerating Stochastic Gradient Descent , 2017, COLT.
[11] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .
[12] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.
[13] H. Robbins. A Stochastic Approximation Method , 1951 .
[14] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .
[15] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.
[16] Jorge Nocedal,et al. A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..
[17] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[18] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[19] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.
[20] Matthew J. Streeter,et al. Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.
[21] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[22] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..
[23] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[24] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..
[25] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.
[26] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.
[27] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.
[28] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.
[29] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[30] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.
[31] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.
[32] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[33] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.
[34] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[35] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.
[36] Zheng Xu,et al. Training Neural Networks Without Gradients: A Scalable ADMM Approach , 2016, ICML.
[37] Yoram Singer,et al. Second Order Optimization Made Practical , 2020, ArXiv.