A Comparative Analysis of Gradient Descent-Based Optimization Algorithms on Convolutional Neural Networks

In this paper, we perform a comparative evaluation of seven most commonly used first-order stochastic gradient-based optimization techniques in a simple Convolutional Neural Network (ConvNet) architectural setup. The investigated techniques are the Stochastic Gradient Descent (SGD), with vanilla (vSGD), with momentum (SGDm), with momentum and nesterov (SGDm+n)), Root Mean Square Propagation (RMSProp), Adaptive Moment Estimation (Adam), Adaptive Gradient (AdaGrad), Adaptive Delta (AdaDelta), Adaptive moment estimation Extension based on infinity norm (Adamax) and Nesterov-accelerated Adaptive Moment Estimation (Nadam). We trained the model and evaluated the optimization techniques in terms of convergence speed, accuracy and loss function using three randomly selected publicly available image classification datasets. The overall experimental results obtained show Nadam achieved better performance across the three datasets in comparison to the other optimization techniques, while AdaDelta performed the worst.

[1]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4]  Ersan YAZAN,et al.  Comparison of the stochastic gradient descent based optimization techniques , 2017, 2017 International Artificial Intelligence and Data Processing Symposium (IDAP).

[5]  Saumik Bhattacharya,et al.  Effects of Degradations on Deep Neural Network Architectures , 2018, ArXiv.

[6]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[7]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[8]  Pascal Côté,et al.  Comparison of Stochastic Optimization Algorithms in Hydrological Model Calibration , 2014 .

[9]  Jon Howell,et al.  Asirra: a CAPTCHA that exploits interest-aligned manual image categorization , 2007, CCS '07.

[10]  Rasmus Hallén A Study of Gradient-Based Algorithms , 2017 .

[11]  Jian Li,et al.  Learning Gradient Descent: Better Generalization and Longer Horizons , 2017, ICML.

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  Kaj Madsen,et al.  Introduction to Optimization and Data Fitting , 2008 .

[14]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[15]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[16]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[17]  Tom Schaul,et al.  Unit Tests for Stochastic Optimization , 2013, ICLR.