Model-Aware Parallelization Strategy for Deep Neural Networks' Distributed Training

It is standard practice today to parallelize deep neural networks' (DNNs') training across distributed clusters. The most common parallelization strategy is data parallelism, but its speedup is limited by tremendous communication overhead. To alleviate this problem, we employ hybrid parallelism, where the size of data to be transferred is decreased. Hybrid parallelism divides machines into groups. Machines in the same group maintain different parts of the model and carry out training tasks collaboratively, and all the groups perform computations simultaneously. Based on hybrid parallelism, we propose a Model-aware Parallelization Strategy, with the purpose of completing the training task as quickly as possible. To determine which strategy to use, we formulate a combinatorial optimization problem and use a greedy heuristic algorithm to decide how to group machines and how to partition the network. We evaluate our Model-aware Parallelization Strategy with six frequently-used deep neural networks. Experimental results show that, for large networks such as VGG, the Model-aware Parallelization Strategy reduces completion time by more than 20 percent.

[1]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[4]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[7]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Kurt Keutzer,et al.  Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.

[9]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[10]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.

[12]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yizhou Yu,et al.  Automatic Photo Adjustment Using Deep Neural Networks , 2014, ACM Trans. Graph..