暂无分享,去创建一个
Dario Amodei | Sam McCandlish | Jared Kaplan | OpenAI Dota Team | Dario Amodei | J. Kaplan | Sam McCandlish
[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[2] Tom Minka,et al. TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.
[3] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[4] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[5] Andrew Y. Ng,et al. Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .
[6] Jorge Nocedal,et al. Sample size selection in optimization methods for machine learning , 2012, Math. Program..
[7] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.
[8] Tom Schaul,et al. No more pesky learning rates , 2012, ICML.
[9] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.
[10] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.
[11] Oriol Vinyals,et al. Qualitatively characterizing neural network optimization problems , 2014, ICLR.
[12] Quoc V. Le,et al. Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.
[13] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[14] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.
[15] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[16] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Pieter Abbeel,et al. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.
[18] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.
[19] David W. Jacobs,et al. Big Batch SGD: Automated Inference using Adaptive Batch Sizes , 2016, ArXiv.
[20] Roger B. Grosse,et al. Distributed Second-Order Optimization using Kronecker-Factored Approximations , 2016, ICLR.
[21] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[22] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.
[23] Naftali Tishby,et al. Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.
[24] Javier Romero,et al. Coupling Adaptive Batch Sizes with Learning Rates , 2016, UAI.
[25] Gabriel Goh,et al. Why Momentum Really Works , 2017 .
[26] Takuya Akiba,et al. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.
[27] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[28] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[29] Yang Yang,et al. Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.
[30] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.
[31] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..
[32] Michael Garland,et al. AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks , 2017, ArXiv.
[33] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.
[34] Joseph Gonzalez,et al. On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent , 2018, ArXiv.
[35] Henryk Michalewski,et al. Distributed Deep Reinforcement Learning: Learn how to play Atari games in 21 minutes , 2018, ISC.
[36] Ethan Dyer,et al. Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.
[37] Pieter Abbeel,et al. Accelerated Methods for Deep Reinforcement Learning , 2018, ArXiv.
[38] Dimitris S. Papailiopoulos,et al. Gradient Diversity: a Key Ingredient for Scalable Distributed Learning , 2017, AISTATS.
[39] James Demmel,et al. ImageNet Training in Minutes , 2017, ICPP.
[40] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.
[41] Dimitris S. Papailiopoulos,et al. The Effect of Network Width on the Performance of Large-batch Training , 2018, NeurIPS.
[42] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.
[43] Quoc V. Le,et al. A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.
[44] Larry Rudolph,et al. Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? , 2018, ArXiv.
[45] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[46] Bryan Catanzaro,et al. Large Scale Language Modeling: Converging on 40GB of Text in Four Hours , 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).
[47] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..
[48] Jason Yosinski,et al. Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.
[49] Raef Bassily,et al. The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.
[50] Renjie Liao,et al. Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.
[51] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.
[52] Jascha Sohl-Dickstein,et al. Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..