Accelerating neural network training with distributed asynchronous and selective optimization (DASO)

With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients across all processes with blocking communication operations after each forward-backward pass. This synchronization is the central algorithmic bottleneck. We introduce the Distributed Asynchronous and Selective Optimization (DASO) method, which leverages multi-GPU compute node architectures to accelerate network training while maintaining accuracy. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks while adjusting the global synchronization rate during the learning process. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks, as compared to current optimized data parallel training methods.

[1]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[2]  Masafumi Yamazaki,et al.  Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds , 2019, ArXiv.

[3]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[4]  Kunle Olukotun,et al.  Understanding and optimizing asynchronous low-precision stochastic gradient descent , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[6]  Karan Sapra,et al.  Hierarchical Multi-Scale Attention for Semantic Segmentation , 2020, ArXiv.

[7]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[8]  P. Pfeifer,et al.  A Brief Primer on Probability Distributions , 2008, SSRN Electronic Journal.

[9]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Marcin Junczys-Dowmunt,et al.  Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation , 2018, EMNLP.

[13]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Rong Zheng,et al.  Asynchronous stochastic gradient descent for DNN training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Ji Liu,et al.  Staleness-Aware Async-SGD for Distributed Deep Learning , 2015, IJCAI.

[16]  Xu Liu,et al.  Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.

[17]  HeAT - a Distributed and GPU-accelerated Tensor Framework for Data Analytics , 2020, ArXiv.

[18]  Parijat Dube,et al.  Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.

[19]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[20]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[21]  Dorian Krause,et al.  JUWELS: Modular Tier-0/1 Supercomputer at Jülich Supercomputing Centre , 2019, Journal of large-scale research facilities JLSRF.

[22]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Rio Yokota,et al.  Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[24]  Yang Wang,et al.  Region Mutual Information Loss for Semantic Segmentation , 2019, NeurIPS.

[25]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[26]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[27]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[28]  Hiroaki Mikami,et al.  Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash , 2018 .