Scaling SGD Batch Size to 32K for ImageNet Training

The most natural way to speed-up the training of large networks is to use data-parallelism on multiple GPUs. To scale Stochastic Gradient (SG) based methods to more processors, one need to increase the batch size to make full use of the computational power of each GPU. However, keeping the accuracy of network with increase of batch size is not trivial. Currently, the state-of-the art method is to increase Learning Rate (LR) proportional to the batch size, and use special learning rate with "warm-up" policy to overcome initial optimization difficulty. By controlling the LR during the training process, one can efficiently use large-batch in ImageNet training. For example, Batch-1024 for AlexNet and Batch-8192 for ResNet-50 are successful applications. However, for ImageNet-1k training, state-of-the-art AlexNet only scales the batch size to 1024 and ResNet50 only scales it to 8192. The reason is that we can not scale the learning rate to a large value. To enable large-batch training to general networks or datasets, we propose Layer-wise Adaptive Rate Scaling (LARS). LARS LR uses different LRs for different layers based on the norm of the weights and the norm of the gradients. By using LARS algoirithm, we can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet. Large batch can make full use of the system's computational power. For example, batch-4096 can achieve 3x speedup over batch-512 for ImageNet training by AlexNet model on a DGX-1 station (8 P100 GPUs).

[1]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[2]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[3]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[4]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[5]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[6]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[8]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[10]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[13]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[14]  Samy Bengio,et al.  Revisiting Distributed Synchronous SGD , 2016, ArXiv.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Mu Li Proposal Scaling Distributed Machine Learning with System and Algorithm Co-design , 2016 .

[17]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[18]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.