FPGAN: An FPGA Accelerator for Graph Attention Networks With Software and Hardware Co-Optimization

The Graph Attention Networks (GATs) exhibit outstanding performance in multiple authoritative node classification benchmark tests (including transductive and inductive). The purpose of this research is to implement an FPGA-based accelerator called FPGAN for graph attention networks that achieves significant improvement on performance and energy efficiency without losing accuracy compared with PyTorch baseline. It eliminates the dependence on digital signal processors (DSPs) and large amounts of on-chip memory and can even work well on low-end FPGA devices. We design FPGAN with software and hardware co-optimization across the full stack from algorithm through architecture. Specifically, we compress model to reduce the model size, quantify features to perform fixed-point calculation, replace multiplication addition cell (MAC) with shift addition units (SAUs) to eliminate the dependence on DSPs, and design an efficient algorithm to approximate SoftMax function. We also adjust the activation functions and fuse operations to further reduce the computation requirement. Moreover, all data is vectorized and aligned for scalable vector computation and efficient memory access. All the above optimizations are integrated into a universal hardware pipeline for various structures of GATs. We evaluate our design on an Inspur F10A board with an Intel Arria 10 GX1150 and 16 GB DDR3 memory. Experimental results show that FPGAN can achieve 7.34 times speedup over Nvidia Tesla V100 and 593 times over Xeon CPU Gold 5115 while maintaining accuracy, and 48 times and 2400 times on energy efficiency respectively.

[1]  Jing Li,et al.  Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform , 2018, FPGA.

[2]  Wenwu Zhu,et al.  Deep Learning on Graphs: A Survey , 2018, IEEE Transactions on Knowledge and Data Engineering.

[3]  Chao Tian,et al.  Efficient Softmax Hardware Architecture for Deep Neural Networks , 2019, ACM Great Lakes Symposium on VLSI.

[4]  Yu Wang,et al.  FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search , 2016, FPGA.

[5]  Viktor Prasanna,et al.  GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms , 2019, FPGA.

[6]  Wenyuan Lu,et al.  Laius: An 8-Bit Fixed-Point CNN Hardware Inference Engine , 2017, 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC).

[7]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[8]  Ahmad Shawahna,et al.  FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review , 2019, IEEE Access.

[9]  Yueh-Chi Wu,et al.  Efficient Dynamic Fixed-Point Quantization of CNN Inference Accelerators for Edge Devices , 2019, 2019 International Symposium on VLSI Design, Automation and Test (VLSI-DAT).

[10]  Yixin Chen,et al.  Link Prediction Based on Graph Neural Networks , 2018, NeurIPS.

[11]  Ryan A. Rossi,et al.  Attention Models in Graphs: A Survey , 2018 .

[12]  Jing Li,et al.  Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search , 2017, FPGA.

[13]  Danyang Zhu,et al.  A High-Speed and Low-Complexity Architecture for Softmax Function in Deep Learning , 2018, 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS).

[14]  Joonseok Lee,et al.  N-GCN: Multi-scale Graph Convolution for Semi-supervised Node Classification , 2018, UAI.

[15]  Emmanuel Müller,et al.  Graph Clustering with Graph Neural Networks , 2020, ArXiv.

[16]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[17]  Dongrui Fan,et al.  HyGCN: A GCN Accelerator with Hybrid Architecture , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18]  Viktor Prasanna,et al.  Accelerating Large Scale GCN Inference on FPGA , 2020, 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[19]  Zhize Huang,et al.  An FPGA Implementation of GCN with Sparse Adjacency Matrix , 2019, 2019 IEEE 13th International Conference on ASIC (ASICON).

[20]  Sirish Kumar Pasupuleti,et al.  Accurate and Efficient Fixed Point Inference for Deep Neural Networks , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[21]  Yu Wang,et al.  ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture , 2017, FPGA.

[22]  Vikas Chandra,et al.  Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations , 2017, ArXiv.

[23]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[24]  Bo Yuan,et al.  Efficient hardware architecture of softmax layer in deep neural network , 2016, 2016 29th IEEE International System-on-Chip Conference (SOCC).

[25]  Yixin Chen,et al.  An End-to-End Deep Learning Architecture for Graph Classification , 2018, AAAI.

[26]  Ernest Jamro,et al.  The Algorithms for FPGA Implementation of Sparse Matrices Multiplication , 2014, Comput. Informatics.

[27]  Hyuk-Jae Lee,et al.  A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[28]  Hamid Reza Zohouri High Performance Computing with FPGAs and OpenCL , 2018, ArXiv.

[29]  Shuai Li,et al.  Heterogeneous system implementation of deep learning neural network for object detection in OpenCL framework , 2018, 2018 International Conference on Electronics, Information, and Communication (ICEIC).

[30]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[31]  Antonino Tumeo,et al.  AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing , 2019, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  R Rekha,et al.  FPGA implementation of exponential function using cordic IP core for extended input range , 2018, 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT).

[33]  Karin Strauss,et al.  A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.