Parameter Tuning Using Adaptive Moment Estimation in Deep Learning Neural Networks

The twin issues of loss quality (accuracy) and training time are critical in choosing a stochastic optimizer for training deep neural networks. Optimization methods for machine learning include gradient descent, simulated annealing, genetic algorithm and second order techniques like Newton’s method. However, the popular method for optimizing neural networks is gradient descent. Overtime, researchers have made gradient descent more responsive to the requirements of improved quality loss (accuracy) and reduced training time by progressing from using simple learning rate to using adaptive moment estimation technique for parameter tuning. In this work, we investigate the performances of established stochastic gradient descent algorithms like Adam, RMSProp, Adagrad, and Adadelta in terms of training time and loss quality. We show practically, using series of stochastic experiments, that adaptive moment estimation has improved the gradient descent optimization method. Based on the empirical outcomes, we recommend further improvement of the method by using higher moments of gradient for parameter tuning (weight update). The output of our experiments also indicate that neural network is a stochastic algorithm.

[1]  Ji Feng,et al.  Deep Forest: Towards An Alternative to Deep Neural Networks , 2017, IJCAI.

[2]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[3]  Anna Dembinska,et al.  Computing moments of discrete order statistics from non-identical distributions , 2018, J. Comput. Appl. Math..

[4]  Neil D. Lawrence,et al.  Deep Gaussian Processes , 2012, AISTATS.

[5]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[6]  Emmanuel Okewu,et al.  Experimental Comparison of Stochastic Optimizers in Deep Learning , 2019, ICCSA.

[7]  Kenneth Heafield,et al.  Combining Global Sparse Gradients with Local Gradients in Distributed Neural Network Training , 2019, EMNLP/IJCNLP.

[8]  Donghwan Kim,et al.  Optimized first-order methods for smooth convex minimization , 2014, Mathematical Programming.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[11]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[12]  Hiroaki Hayashi,et al.  Improving Stochastic Gradient Descent with Feedback , 2016, ArXiv.

[13]  Kenneth Heafield,et al.  Combining Global Sparse Gradients with Local Gradients , 2018 .

[14]  Yann LeCun,et al.  Deep learning with Elastic Averaging SGD , 2014, NIPS.

[15]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[16]  John Moody,et al.  Learning rate schedules for faster stochastic gradient search , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.