A Quick Survey on Large Scale Distributed Deep Learning Systems

Deep learning have been widely used in various fields and has worked very well as a major role. While the gradual penetration into various fields, data quantity of each applications is increasing tremendously, and so as the computation complexity and model parameters. As an obvious result, the training and inference is time consuming. For example, a classic Resnet50 classification model will be trained in 14 days on a NVIDIA M40 GPU with ImageNet data set. Thus, distributed acceleration is a very useful way to dispatch the computation of training and even inference to scale of nodes in parallel and accelerate the whole process. Facebook's work and UC Berkeley's acceleration can training the Resnet-50 model within hour and minutes by distributed deep learning algorithm and system, representatively. As other distributed accelerations, it gives a possibility to accelerate large models on large data sets from weeks to minutes, which gives researchers and developers more space to explore and search. However, besides acceleration, what other issues will be confronted of the distributed deep learning system? Where is the upper limit of acceleration? What application will acceleration be used for? What is the price and cost of acceleration? In this paper, we will take a simple and quick survey on the distributed deep learning system from algorithm perspective, distributed system perspective and applications perspective. We will present several recent excellent works, and bring analysis on the restricts and prospects of the distributed methods.

[1]  Sam Ade Jacobs,et al.  Towards Scalable Parallel Training of Deep Neural Networks , 2017, MLHPC@SC.

[2]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[3]  Hanqing Lu,et al.  Recent advances in efficient computation of deep convolutional neural networks , 2018, Frontiers of Information Technology & Electronic Engineering.

[4]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Vivienne Sze,et al.  Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[6]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[7]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[8]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[9]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[11]  Eric P. Xing,et al.  High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[12]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[13]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[14]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[15]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  J. Demmel,et al.  ImageNet Training in 24 Minutes , 2017 .

[18]  Shengen Yan,et al.  Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Takuya Akiba,et al.  Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes , 2017, ArXiv.

[21]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[22]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]  Yuning Jiang,et al.  MegDet: A Large Mini-Batch Object Detector , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2015, IEEE Trans. Big Data.

[25]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[26]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Aaron Q. Li,et al.  Parameter Server for Distributed Machine Learning , 2013 .