Efficient Full-Matrix Adaptive Regularization
暂无分享,去创建一个
Yi Zhang | Karan Singh | Naman Agarwal | Elad Hazan | Cyril Zhang | Brian Bullins | Xinyi Chen | Elad Hazan | Naman Agarwal | Brian Bullins | Xinyi Chen | Karan Singh | Cyril Zhang | Yi Zhang
[1] Tengyu Ma,et al. Finding approximate local minima faster than gradient descent , 2016, STOC.
[2] Jimmy Ba,et al. Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.
[3] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[4] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .
[5] Ann Bies,et al. The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.
[6] Richard Socher,et al. An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.
[7] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.
[8] Francesco Orabona,et al. On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.
[9] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[10] Elad Hazan,et al. Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.
[11] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .
[12] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[13] Jascha Sohl-Dickstein,et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Understanding and Improvement , 2017, ArXiv.
[14] Haipeng Luo,et al. Efficient Second Order Online Learning by Sketching , 2016, NIPS.
[15] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[17] Sanjiv Kumar,et al. Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.
[18] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.
[19] Yoram Singer,et al. Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.
[20] Andrea Montanari,et al. Convergence rates of sub-sampled Newton methods , 2015, NIPS.
[21] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[22] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[23] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[24] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.
[25] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..
[26] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.
[27] Naman Agarwal,et al. Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..
[28] Enhong Chen,et al. Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions , 2018, ICLR.
[29] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.
[30] Li Shen,et al. On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks , 2018, ArXiv.
[31] Yuan Cao,et al. On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.
[32] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.
[33] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.
[34] Yoshua Bengio,et al. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.
[35] Xavier Gastaldi,et al. Shake-Shake regularization , 2017, ArXiv.
[36] Xiaoxia Wu,et al. L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .
[37] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .
[38] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..
[39] Yi Zhang,et al. Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.
[40] Hao Li,et al. Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.
[41] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[42] Sébastien Bubeck,et al. Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..
[43] Alistair P. Rendell,et al. CompAdaGrad: A Compressed, Complementary, Computationally-Efficient Adaptive Gradient Method , 2016, ArXiv.
[44] Yair Carmon,et al. "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , 2017, ICML.
[45] Zeyuan Allen-Zhu,et al. Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..
[46] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .
[47] Chris Dyer,et al. On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.
[48] Joachim M. Buhmann,et al. Scalable Adaptive Stochastic Optimization Using Random Projections , 2016, NIPS.
[49] Zeyuan Allen Zhu,et al. Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.