HyGCN: A GCN Accelerator with Hybrid Architecture

Inspired by the great success of neural networks, graph convolutional neural networks (GCNs) are proposed to analyze graph data. GCNs mainly include two phases with distinct execution patterns. The Aggregation phase, behaves as graph processing, showing a dynamic and irregular execution pattern. The Combination phase, acts more like the neural networks, presenting a static and regular execution pattern. The hybrid execution patterns of GCNs require a design that alleviates irregularity and exploits regularity. Moreover, to achieve higher performance and energy efficiency, the design needs to leverage the high intra-vertex parallelism in Aggregation phase, the highly reusable inter-vertex data in Combination phase, and the opportunity to fuse phase-by-phase execution introduced by the new features of GCNs. However, existing architectures fail to address these demands. In this work, we first characterize the hybrid execution patterns of GCNs on Intel Xeon CPU. Guided by the characterization, we design a GCN accelerator, HyGCN, using a hybrid architecture to efficiently perform GCNs. Specifically, first, we build a new programming model to exploit the fine-grained parallelism for our hardware design. Second, we propose a hardware design with two efficient processing engines to alleviate the irregularity of Aggregation phase and leverage the regularity of Combination phase. Besides, these engines can exploit various parallelism and reuse highly reusable data efficiently. Third, we optimize the overall system via inter-engine pipeline for inter-phase fusion and priority-based off-chip memory access coordination to improve off-chip bandwidth utilization. Compared to the state-of-the-art software framework running on Intel Xeon CPU and NVIDIA V100 GPU, our work achieves on average 1509× speedup with 2500× energy reduction and average 6.5× speedup with 10× energy reduction, respectively.

[1]  Guojing Cong,et al.  Application data prefetching on the IBM Blue Gene/Q supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[3]  Cao Xiao,et al.  FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling , 2018, ICLR.

[4]  Jure Leskovec,et al.  Hierarchical Graph Representation Learning with Differentiable Pooling , 2018, NeurIPS.

[5]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[6]  Wenwu Zhu,et al.  Deep Learning on Graphs: A Survey , 2018, IEEE Transactions on Knowledge and Data Engineering.

[7]  Ken-ichi Kawarabayashi,et al.  Representation Learning on Graphs with Jumping Knowledge Networks , 2018, ICML.

[8]  Junzhou Huang,et al.  Adaptive Sampling Towards Fast Graph Representation Learning , 2018, NeurIPS.

[9]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[10]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[11]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[12]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Le Song,et al.  Learning Steady-States of Iterative Algorithms over Graphs , 2018, ICML.

[14]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[15]  Jure Leskovec,et al.  Graph Convolutional Neural Networks for Web-Scale Recommender Systems , 2018, KDD.

[16]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[17]  Alexander Peysakhovich,et al.  PyTorch-BigGraph: A Large-scale Graph Embedding System , 2019, SysML.

[18]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[19]  Chang Zhou,et al.  AliGraph: A Comprehensive Graph Neural Network Platform , 2019, Proc. VLDB Endow..

[20]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[21]  Ozcan Ozturk,et al.  Energy Efficient Architecture for Graph Analytics Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[22]  Alex Fout,et al.  Protein Interface Prediction using Graph Convolutional Networks , 2017, NIPS.

[23]  Pinar Yanardag,et al.  Deep Graph Kernels , 2015, KDD.

[24]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[25]  Yu Wang,et al.  NXgraph: An efficient graph processing system on a single machine , 2015, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[26]  William J. Dally,et al.  Scaling the Power Wall: A Path to Exascale , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Zhijia Zhao,et al.  Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing , 2018, ASPLOS.

[28]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[29]  Jure Leskovec,et al.  Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[30]  Michel Dubois,et al.  Sequential Hardware Prefetching in Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[31]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[32]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[33]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[34]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[35]  Zhimin Zhang,et al.  Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Co-Design Approach , 2019, MICRO.

[36]  Keshav Pingali,et al.  How much parallelism is there in irregular applications? , 2009, PPoPP '09.

[37]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.