Alleviating Bottlenecks for DNN Execution on GPUs via Opportunistic Computing

Edge computing and IoT applications are severely constrained by limited hardware resource. This makes memory-consuming DNN (Deep Neural Network) frameworks not applicable to edge computing. Simple algorithms such as direct convolution are finding their way in embedded machine learning. As one of the most widely used platforms for DNN acceleration, GPUs face the bottleneck of on-chip bandwidth. This work introduces a GPU DNN execution architecture that can relieve the on-chip bandwidth bottleneck by reducing data movement through opportunistic computing. We first investigate data access patterns in the hardware's view. Then we propose two opportunistic computing techniques to predictably perform computation when data is available with the help of assistant warps. By moving computation to data, our techniques are able to significantly reduce data movement and relieve the DNN execution bottleneck. Our evaluation results show that the proposed technique can improve DNN application performance as much as 55%.

[1]  Leibo Liu,et al.  An Efficient Kernel Transformation Architecture for Binary- and Ternary-Weight Neural Network Inference , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[2]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[3]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[4]  Boris Murmann,et al.  Memory-Optimal Direct Convolutions for Maximizing Classification Accuracy in Embedded Applications , 2019, ICML.

[5]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[6]  Cesare Alippi,et al.  Moving Convolutional Neural Networks to Embedded Systems: The AlexNet and VGG-16 Case , 2018, 2018 17th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Tajana Simunic,et al.  Efficient neural network acceleration on GPGPU using content addressable memory , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[10]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[11]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[12]  Lee-Sup Kim,et al.  A kernel decomposition architecture for binary-weight Convolutional Neural Networks , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Robert Mullins,et al.  1D-FALCON: Accelerating Deep Convolutional Neural Network Inference by Co-optimization of Models and Underlying Arithmetic Implementation , 2017, ICANN.

[14]  Radu Marculescu,et al.  On-Chip Communication Network for Efficient Training of Deep Convolutional Networks on Heterogeneous Manycore Systems , 2017, IEEE Transactions on Computers.

[15]  Olivier Temam,et al.  Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[16]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[17]  Mahmut T. Kandemir,et al.  Opportunistic Computing in GPU Architectures , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[18]  Tze Meng Low,et al.  High Performance Zero-Memory Overhead Direct Convolutions , 2018, ICML.

[19]  Scott A. Mahlke,et al.  DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).