论文信息 - Laplacian smoothing gradient descent - 字舞流文

Laplacian smoothing gradient descent

We propose a class of very simple modifications of gradient descent and stochastic gradient descent. We show that when applied to a large variety of machine learning problems, ranging from logistic regression to deep neural nets, the proposed surrogates can dramatically reduce the variance, allow to take a larger step size, and improve the generalization accuracy. The methods only involve multiplying the usual (stochastic) gradient by the inverse of a positive definitive matrix (which can be computed efficiently by FFT) with a low condition number coming from a one-dimensional discrete Laplacian or its high order generalizations. It also preserves the mean and increases the smallest component and decreases the largest component. The theory of Hamilton-Jacobi partial differential equations demonstrates that the implicit version of the new algorithm is almost the same as doing gradient descent on a new function which (i) has the same global minima as the original function and (ii) is ``more convex". Moreover, we show that optimization algorithms with these surrogates converge uniformly in the discrete Sobolev $H_\sigma^p$ sense and reduce the optimality gap for convex optimization problems. The code is available at: \url{this https URL}

Stanley Osher | Minh Pham | Penghang Yin | Xiyang Luo | Bao Wang | Alex Tong Lin | S. Osher | Penghang Yin | Bao Wang | Minh Pham | Xiyang Luo | A. Lin

[1] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[2] H. Robbins. A Stochastic Approximation Method , 1951 .

[3] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .

[4] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[5] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .

[6] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[7] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[8] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[9] Stanley Osher,et al. Stochastic Backward Euler: An Implicit Gradient Descent Algorithm for k-Means Clustering , 2017, J. Sci. Comput..

[10] Zeyuan Allen-Zhu,et al. Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[11] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[12] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[13] A. Shapiro,et al. Convergence analysis of gradient descent stochastic algorithms , 1996 .

[14] Alan L. Yuille,et al. Sobolev gradients and joint variational image segmentation, denoising, and deblurring , 2009, Electronic Imaging.

[15] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[16] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17] Georg Heigold,et al. An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] André Elisseeff,et al. Stability and Generalization , 2002, J. Mach. Learn. Res..

[19] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[20] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[21] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[22] Anders Krogh,et al. A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[23] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[24] Claus Nebauer,et al. Evaluation of convolutional neural networks for visual recognition , 1998, IEEE Trans. Neural Networks.

[25] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[26] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[27] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.

[28] R. Bhatia. Matrix Analysis , 1996 .

[29] Léon Bottou,et al. Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[30] Yoram Singer,et al. Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[31] Léon Bottou,et al. Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[32] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[33] Stefano Soatto,et al. Deep relaxation: partial differential equations for optimizing deep neural networks , 2017, Research in the Mathematical Sciences.

[34] Kaiming He,et al. Group Normalization , 2018, ECCV.

[35] Soumith Chintala,et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[36] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[37] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[38] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[40] Michael I. Jordan,et al. Non-convex Finite-Sum Optimization Via SCSG Methods , 2017, NIPS.

[41] Yoshua Bengio,et al. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[42] Ohad Shamir,et al. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[43] G. M.,et al. Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[44] Ning Qian,et al. On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[45] J. Moreau. Proximité et dualité dans un espace hilbertien , 1965 .

[46] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Shai Shalev-Shwartz,et al. Fast Rates for Empirical Risk Minimization of Strict Saddle Problems , 2017, COLT.

[48] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[49] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[50] Hao Li,et al. Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[51] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[52] Yoshua Bengio,et al. SGD Smooths The Sharpest Directions , 2018 .