论文信息 - ImageNet Training by CPU: AlexNet in 11 Minutes and ResNet-50 in 48 Minutes

ImageNet Training by CPU: AlexNet in 11 Minutes and ResNet-50 in 48 Minutes

Since its creation, the ImageNet 1-k benchmark set has played a significant role as a benchmark for ascertaining the accuracy of different deep neural net (DNN) models on the classification problem. Moreover, in recent years it has also served as the principal benchmark for assessing different approaches to DNN training. Finishing a 90-epoch ImageNet1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days. This training requires 10 single precision operations in total. On the other hand, the world’s current fastest supercomputer can finish 2× 10 single precision operations per second. If we can make full use of the computing capability of a supercomputer for DNN training, we should be able to finish the 90-epoch ResNet-50 training in five seconds. Over the last two years a number of researchers have focused on how to close this significant performance gap through scaling DNN training to larger numbers of processors. Most successful approaches to scaling the training of ImageNet have used the approach of synchronous stochastic gradient descent. However, to scale synchronous stochastic gradient descent one must also increase the batch size used in each iteration. Thus, for many researchers, the focus on scaling DNN training has translated into a focus on developing approaches that enable increasing the batch size in data-parallel synchronous stochastic gradient descent without losing accuracy over a fixed number of epochs. As a result, we have seen the batch size and number of processors successfully utilized increase from 1K batch/128 processors to 8K batch/256 processors over the last two years. The recently published LARS algorithm increased batch size further to 32K for some DNN models. Following up on this work, we wished to confirm that LARS could be used to further scale the number of processors efficiently used in DNN training and, and as a result, further reduce the total training time. In this paper we present the results of this investigation: using LARS we were able to efficiently utilize 1024 CPUs to finish the 100-epoch ImageNet training with AlexNet in 11 minutes, and we finished 90-epoch ImageNet training with ResNet-50 in 48 minutes (batch size = 32K). Furthermore, when we increase the batch size to above 20K, our accuracy is much higher than Facebook’s on corresponding batch sizes (Figure 1). Our source code is available upon request. It will also be released in Intel Caffe. Figure 1: Because we use weaker data augmentation, our baseline’s accuracy is slightly lower than Facebook’s version (76.2% vs 75.4%). However, at very large batch sizes our accuracy is much higher than Facebook’s accuracy. Facebook’s accuracy is from their own report (Goyal et al. 2017). Our accuracy scaling efficiency is much higher than Facebook’s version

[1] Pradeep Dubey,et al. Distributed Deep Learning Using Synchronous Stochastic Gradient Descent , 2016, ArXiv.

[2] Forrest N. Iandola,et al. How to scale distributed deep learning? , 2016, ArXiv.

[3] Ioannis Mitliagkas,et al. Asynchrony begets momentum, with an application to deep learning , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[4] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[5] Dong Yu,et al. On parallelizability of stochastic gradient descent for speech DNNS , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Marc'Aurelio Ranzato,et al. Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9] Mu Li. Proposal Scaling Distributed Machine Learning with System and Algorithm Co-design , 2016 .

[10] Yann LeCun,et al. Deep learning with Elastic Averaging SGD , 2014, NIPS.

[11] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[12] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[13] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[14] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[15] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[16] Anthony Skjellum,et al. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[17] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[18] Forrest N. Iandola,et al. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[20] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[21] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[22] Samy Bengio,et al. Revisiting Distributed Synchronous SGD , 2016, ArXiv.