Adaptive Learning Rate Adjustment with Short-Term Pre-Training in Data-Parallel Deep Learning

This paper introduces a method to adaptively choose a learning rate (LR) with short-term pre-training (STPT). This is useful for quick model prototyping in data-parallel deep learning. For unknown models, it is necessary to tune numerous hyperparameters. The proposed method reduces computational time and increases efficiency in finding an appropriate LR; multiple LRs are evaluated by STPT in data-parallel deep learning. STPT means training only with the beginning iterations in an epoch. When eight LRs are evaluated using eight parallel workers, the proposed method can easily reduce the computational time by 87.5% in comparison with the conventional method. The accuracy is also improved by 4.8% in comparison with the conventional method with a reference LR of 0.1; thus, no deterioration in accuracy is observed. For an unknown model, this method shows a better training curve trend than other cases with fixed LRs.

[1]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  A. Krizhevsky Convolutional Deep Belief Networks on CIFAR-10 , 2010 .

[4]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[5]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[6]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[9]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[10]  Shintaro Izumi,et al.  A layer-block-wise pipeline for memory and bandwidth reduction in distributed deep learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[11]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[12]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Technické V Brně,et al.  DEEP LEARNING FOR IMAGE RECOGNITION , 2013 .

[14]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[15]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[18]  Y Yu,et al.  Deep Learning for Image Recognition , 2018 .

[19]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.