An overview of gradient descent optimization algorithms

Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.

[1]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[2]  John Moody,et al.  Learning rate schedules for faster stochastic gradient search , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[3]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[4]  H. Robbins A Stochastic Approximation Method , 1951 .

[5]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[6]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[8]  Mohamed Chtourou,et al.  On the training of recurrent neural networks , 2011, Eighth International Multi-Conference on Systems, Signals & Devices.

[9]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[10]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[11]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[12]  Razvan Pascanu,et al.  Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Geoffrey E. Hinton,et al.  Training Recurrent Neural Networks , 2013 .

[14]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[15]  Matthew J. Streeter,et al.  Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning , 2014, NIPS.

[16]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[22]  Peter Kulchyski and , 2015 .

[23]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[24]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .