Performance Evaluation and Optimization of HBM-Enabled GPU for Data-Intensive Applications

Graphics processing units (GPUs) are widely used to accelerate data-intensive applications. To improve the performance of data-intensive applications, higher GPU memory bandwidth is desirable. Traditional graphics double data rate memories achieve higher bandwidth by increasing frequency, which leads to excessive power consumption. Recently, a new memory technology called high-bandwidth memory (HBM) based on 3-D die-stacking technology has been used in the latest generation of GPUs, which can provide both high bandwidth and low power consumption with in-package-stacked DRAM memory. However, the capacity of integrated in-package-stacked memory is limited (e.g., only 4 GB for the state-of-the-art HBM-enabled GPU, AMD Radeon Fury X). In this paper, we implement two representative data-intensive applications, convolutional neural network (CNN) and breadth-first search (BFS) on an HBM-enabled GPU to evaluate the improvement brought by the adoption of the HBM, and to investigate techniques to fully unleash the benefits of such HBM-enabled GPU. Based on the evaluation results, we first propose a software pipeline to alleviate the capacity limitation of the HBM for CNN. We then design two programming techniques to improve the utilization of memory bandwidth for BFS application. Experimental results demonstrate that our pipelined CNN training achieves a <inline-formula> <tex-math notation="LaTeX">$1.63\times $ </tex-math></inline-formula> speedup on an HBM-enabled GPU compared with the best high-performance GPU in market, and the two optimization techniques for the BFS algorithm make it at most <inline-formula> <tex-math notation="LaTeX">$24.5\times $ </tex-math></inline-formula> (<inline-formula> <tex-math notation="LaTeX">$9.8\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$2.5\times $ </tex-math></inline-formula> for each technique, respectively) faster than conventional implementations.

[1]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[4]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[5]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[6]  R. Sindhu Reddy,et al.  DLAU: A Scalable Deep Learning Accelerator Unit on FPGA , 2018 .

[7]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Ah Chung Tsoi,et al.  Ranking Attack Graphs with Graph Neural Networks , 2009, ISPEC.

[9]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[10]  Maya Gokhale,et al.  Hardware Technologies for High-Performance Data-Intensive Computing , 2008, Computer.

[11]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[14]  Yuan Xie,et al.  Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface , 2013, TACO.

[15]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[16]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[17]  Andrew S. Grimshaw,et al.  High-Performance and Scalable GPU Graph Traversal , 2015, ACM Trans. Parallel Comput..

[18]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.