暂无分享,去创建一个
Wojciech Zaremba | Jie Tang | Diogo Almeida | Clemens Winter | Wojciech Zaremba | Jie Tang | Clemens Winter | Diogo Almeida
[1] Mark W. Schmidt,et al. StopWasting My Gradients: Practical SVRG , 2015, NIPS.
[2] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.
[3] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.
[4] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.
[5] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.
[6] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.
[7] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[8] Yoram Singer,et al. Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.
[9] Ryan P. Adams,et al. Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.
[10] Stephen Merity,et al. Single Headed Attention RNN: Stop Thinking With Your Head , 2019, ArXiv.
[11] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[13] Yoshua Bengio,et al. Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.
[14] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[15] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.
[16] Marcin Andrychowicz,et al. Learning to learn by gradient descent by gradient descent , 2016, NIPS.
[17] Dario Amodei,et al. An Empirical Model of Large-Batch Training , 2018, ArXiv.
[18] Tat-Seng Chua,et al. Neural Collaborative Filtering , 2017, WWW.
[19] Misha Denil,et al. Learned Optimizers that Scale and Generalize , 2017, ICML.
[20] Leslie N. Smith,et al. Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).
[21] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[22] Aleksander Madry,et al. The Two Regimes of Deep Network Training , 2020, ArXiv.
[23] Jaehoon Lee,et al. On Empirical Comparisons of Optimizers for Deep Learning , 2019, ArXiv.
[24] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[25] Renjie Liao,et al. Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.
[26] Gang Wang,et al. Reinforcement Learning for Learning Rate Control , 2017, ArXiv.
[27] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..
[28] Yoram Singer,et al. Second Order Optimization Made Practical , 2020, ArXiv.
[29] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.
[30] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.
[31] Marcin Andrychowicz,et al. Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.
[32] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.
[33] Mark W. Schmidt,et al. Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.
[34] Jeremy Nixon,et al. Understanding and correcting pathologies in the training of learned optimizers , 2018, ICML.
[35] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[36] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[37] Massimiliano Pontil,et al. On the Iteration Complexity of Hypergradient Computation , 2020, ICML.
[38] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.
[40] Jason Yosinski,et al. First-Order Preconditioning via Hypergradient Descent , 2019, ArXiv.
[41] Sebastian Nowozin,et al. Learning Step Size Controllers for Robust Neural Network Training , 2016, AAAI.
[42] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[43] Jascha Sohl-Dickstein,et al. Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves , 2020, ArXiv.