暂无分享,去创建一个
[1] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .
[2] H. Robbins. A Stochastic Approximation Method , 1951 .
[3] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.
[4] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.
[5] Lutz Prechelt,et al. Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.
[6] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[7] Mark W. Schmidt,et al. Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..
[8] Jorge Nocedal,et al. Sample size selection in optimization methods for machine learning , 2012, Math. Program..
[9] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.
[10] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[11] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.
[12] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.
[13] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[14] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[15] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.
[16] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[17] Javier Romero,et al. Coupling Adaptive Batch Sizes with Learning Rates , 2016, UAI.
[18] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[19] E Weinan,et al. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.
[20] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[21] Takuya Akiba,et al. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.
[22] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[23] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[24] David W. Jacobs,et al. Automated Inference with Adaptive Batches , 2017, AISTATS.
[25] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[26] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.
[27] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.
[28] J. Demmel,et al. ImageNet Training in 24 Minutes , 2017 .
[29] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.
[30] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..
[31] James Demmel,et al. ImageNet Training in Minutes , 2017, ICPP.
[32] Quoc V. Le,et al. A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.
[33] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..