fuseGNN: Accelerating Graph Convolutional Neural Network Training on GPGPU

Graph convolutional neural networks (GNN) have achieved state-of-the-art performance on tasks like node classification. It has become a new workload family member in data-centers. GNN works on irregular graph-structured data with three distinct phases: Combination, Graph Processing, and Aggregation. While Combination phase has been well supported by sgemm kernels in cuBLAS, the other two phases are still inefficient on GPGPU due to the lack of optimized CUDA kernels. In particular, Aggregation phase introduces large volume of DRAM storage footprint and data movement, and both Aggregation and Graph Processing phases suffer from high kernel launching time. These inefficiencies not only decrease training throughput but also limit users from training GNNs on larger graphs on GPGPU. Although these problems have been partially alleviated by recent studies, their optimizations are still not sufficient. In this paper, we propose fuseGNN, an extension of PyTorch that provides highly optimized APIs and CUDA kernels for GNN. First, two different programming abstractions for Aggregation phase are utilized to handle graphs with different average degrees. Second, dedicated GPGPU kernels are developed for Aggregation and Graph Processing in both forward and backward passes, in which kernel-fusion along with other optimization strategies are applied to reduce kernel launching time and latency as well as exploit data reuse opportunities. Evaluation on multiple benchmarks shows that fuseGNN achieves up to 5.3× end-to-end speedup over state-of-the-art frameworks, and the DRAM storage footprint is reduced by several orders of magnitude on large datasets.

[1]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[2]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[3]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[4]  Tarek S. Abdelrahman,et al.  Reducing divergence in GPGPU programs with loop merging , 2013, GPGPU@ASPLOS.

[5]  Kilian Q. Weinberger,et al.  Simplifying Graph Convolutional Networks , 2019, ICML.

[6]  Alexander J. Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[7]  Martin D. F. Wong,et al.  Accelerating aerial image simulation with GPU , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[8]  Yangdong Deng,et al.  Taming irregular EDA applications on GPUs , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[9]  Bernard Ghanem,et al.  DeepGCNs: Making GCNs Go as Deep as CNNs , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Ludek Matyska,et al.  Optimizing CUDA code by kernel fusion: application on BLAS , 2013, The Journal of Supercomputing.

[11]  Wei Yi,et al.  Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[12]  Dongrui Fan,et al.  HyGCN: A GCN Accelerator with Hybrid Architecture , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[14]  Chong Wang,et al.  Attention-based Graph Neural Network for Semi-supervised Learning , 2018, ArXiv.

[15]  Yangdong Deng,et al.  Accelerating RTL simulation with GPUs , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[16]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[17]  Yafei Dai,et al.  NeuGraph: Parallel Deep Neural Network Computation on Large Graphs , 2019, USENIX ATC.

[18]  Dongrui Fan,et al.  Characterizing and Understanding GCNs on GPU , 2020, IEEE Computer Architecture Letters.