Heterogeneous model parallelism for deep neural networks

Abstract Deep neural networks (DNNs) have transformed computer vision, establishing themselves as the current state-of-the-art for image processing. Nevertheless, the training of current large DNN models is one of the main challenges to be solved. In this sense, data-parallelism has been the most widespread distributed training strategy since it is easy to program and can be applied to almost all cases. However, this solution suffers from several limitations, such as its high communication requirements and the memory constraints when training very large models. To overcome these limitations model-parallelism has been proposed, solving the most substantial problems of the former strategy. However, describing and implementing the parallelization of the training of a DNN model across a set of processes deployed on several devices is a challenging task. Current proposed solutions assume a homogeneous distribution, being impractical when working with devices of different computational capabilities, which is quite common on high performance computing platforms. To address previous shortcomings, this work proposes a novel model-parallelism technique considering heterogeneous platforms, where a load balancing mechanism between uneven devices of an HPC platform has been implemented. Our proposal takes advantage of the Google Brain’s Mesh-TensorFlow for convolutional networks, splitting computing tensors across filter dimension in order to balance the computational load of the available devices. Conducted experiments show an improvement in the exploitation of heterogeneous computational resources, enhancing the training performance. The code is available on: https://github.com/mhaut/HeterogeneusModelDNN.

[1]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[2]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[3]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[4]  Javier Plaza,et al.  Training deep neural networks: a static load balancing approach , 2020, The Journal of Supercomputing.

[5]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[6]  Luc Van Gool,et al.  Learning Filter Basis for Convolutional Neural Network Compression , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Hassan Foroosh,et al.  Factorized Convolutional Neural Networks , 2016, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[8]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..

[9]  Meng Zhang,et al.  Recent Advances in Convolutional Neural Network Acceleration , 2018, Neurocomputing.

[10]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[11]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[12]  Shuchang Zhou,et al.  DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[13]  Tao Wang,et al.  Image Classification at Supercomputer Scale , 2018, ArXiv.

[14]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[15]  Xiaoguang Liu,et al.  EC-DNN: A new method for parallel training of deep neural networks , 2018, Neurocomputing.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Andre Esteva,et al.  A guide to deep learning in healthcare , 2019, Nature Medicine.

[18]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[19]  Alexey L. Lastovetsky,et al.  Dynamic Load Balancing of Parallel Computational Iterative Routines on Platforms with Memory Heterogeneity , 2010, Euro-Par Workshops.

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[23]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[24]  Aman Jantan,et al.  State-of-the-art in artificial neural network applications: A survey , 2018, Heliyon.

[25]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[26]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[27]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[28]  Niall O' Mahony,et al.  Deep Learning vs. Traditional Computer Vision , 2019, CVC.

[29]  Jianping Yin,et al.  Distributed and asynchronous Stochastic Gradient Descent with variance reduction , 2017, Neurocomputing.

[30]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[31]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[32]  Guang Shi,et al.  A distributed parallel training method of deep belief networks , 2020, Soft Computing.

[33]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[34]  Ignacio Rojas,et al.  Neural networks: An overview of early research, current frameworks and new challenges , 2016, Neurocomputing.

[35]  Olatunji Ruwase,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Yong Man Ro,et al.  Convolution with Logarithmic Filter Groups for Efficient Shallow CNN , 2017, MMM.

[37]  Steven C. H. Hoi,et al.  Deep Learning for Image Super-Resolution: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[39]  Mercedes Eugenia Paoletti,et al.  Deep learning classifiers for hyperspectral imaging: A review , 2019 .

[40]  Ziming Zhong,et al.  FuPerMod: A Framework for Optimal Data Partitioning for Parallel Scientific Applications on Dedicated Heterogeneous HPC Platforms , 2013, PaCT.

[41]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[42]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.