Slice-and-Forge: Making Better Use of Caches for Graph Convolutional Network Accelerators

Graph convolutional networks (GCNs) are becoming increasingly popular as they can process a wide variety of data formats that prior deep neural networks cannot easily support. One key challenge in designing hardware accelerators for GCNs is the vast size and randomness in their data access patterns which greatly reduces the effectiveness of the limited on-chip cache. Aimed at improving the effectiveness of the cache by mitigating the irregular data accesses, prior studies often employ the vertex tiling techniques used in traditional graph processing applications. While being effective at enhancing the cache efficiency, those approaches are often sensitive to the tiling configurations where the optimal setting heavily depends on target input datasets. Furthermore, the existing solutions require manual tuning through trial-and-error or rely on sub-optimal analytical models. In this paper, we propose Slice-and-Forge (SnF), an efficient hardware accelerator for GCNs which greatly improves the effectiveness of the limited on-chip cache. SnF chooses a tiling strategy named feature slicing that splits the features into vertical slices and processes them in the outermost loop of the execution. This particular choice results in a repetition of the identical computational patterns over irregular graph data over multiple rounds. Taking advantage of such repetitions, SnF dynamically tunes its tile size. Our experimental results reveal that SnF can achieve 1.73× higher performance in geomean compared to prior work on multi-engine settings, and 1.46× higher performance in geomean on small scale settings, without the need for off-line analyses.

[1]  Philip Levis,et al.  GRIP: A Graph Neural Network Accelerator Architecture , 2020, IEEE Transactions on Computers.

[2]  Haoran You,et al.  I-GCN: A Graph Convolutional Network Accelerator with Runtime Locality Enhancement through Islandization , 2021, MICRO.

[3]  Paolo Ienne,et al.  Large-Scale Graph Processing on FPGAs with Caches for Thousands of Simultaneous Misses , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[4]  Arvind,et al.  FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[5]  Tony Nowatzki,et al.  PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[6]  Hanwoong Jung,et al.  Sparsity-Aware and Re-configurable NPU Architecture for Samsung Flagship Mobile SoC , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[7]  Daniel Sánchez,et al.  SpZip: Architectural Support for Effective Data Compression In Irregular Applications , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[8]  Yang Wang,et al.  Dual-side Sparse Tensor Core , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[9]  Chao-Tsung Huang,et al.  RingCNN: Exploiting Algebraically-Sparse Ring Tensors for Energy-Efficient CNN-Based Computational Imaging , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[10]  Ahmed Louri,et al.  GCNAX: A Flexible and Energy-efficient Accelerator for Graph Convolutional Neural Networks , 2021, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[11]  Dongrui Fan,et al.  Hardware Acceleration for GCNs via Bidirectional Fusion , 2021, IEEE Computer Architecture Letters.

[12]  Huawei Li,et al.  EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks , 2019, IEEE Transactions on Computers.

[13]  Long Zheng,et al.  A Locality-Aware Energy-Efficient Accelerator for Graph Mining Applications , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Xing Hu,et al.  Rubik: A Hierarchical Architecture for Efficient Graph Learning , 2020, ArXiv.

[15]  Rakesh Kumar,et al.  Hardware Acceleration of Graph Neural Networks , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[16]  Bruce Jacob,et al.  DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator , 2020, IEEE Computer Architecture Letters.

[17]  J. Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[18]  Song Han,et al.  SpArch: Efficient Architecture for Sparse Matrix Multiplication , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19]  Dipankar Das,et al.  SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[20]  Nitish Srivastava,et al.  Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Bahar Asgari,et al.  ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22]  Yuxiao Dong,et al.  Microsoft Academic Graph: When experts are not enough , 2020, Quantitative Science Studies.

[23]  Dongrui Fan,et al.  HyGCN: A GCN Accelerator with Hybrid Architecture , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[24]  Yanzhi Wang,et al.  PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning , 2020, ASPLOS.

[25]  Antonino Tumeo,et al.  AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing , 2019, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Shikhar Vashishth Neural Graph Embedding Methods for Natural Language Processing , 2019, ArXiv.

[27]  Onur Mutlu,et al.  SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations , 2019, MICRO.

[28]  Tze Meng Low,et al.  Efficient SpMV Operation for Large and Highly Sparse Matrices using Scalable Multi-way Merge Parallelization , 2019, MICRO.

[29]  Aamer Jaleel,et al.  ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[30]  Jose-Maria Arnau,et al.  Neuron-Level Fuzzy Memoization in RNNs , 2019, MICRO.

[31]  Zhiru Zhang,et al.  Boosting the Performance of CNN Accelerators with Dynamic Fine-Grained Channel Gating , 2019, MICRO.

[32]  Gu-Yeon Wei,et al.  MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation , 2019, MICRO.

[33]  T. N. Vijaykumar,et al.  SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks , 2019, MICRO.

[34]  Yuan Xie,et al.  Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs , 2019, MICRO.

[35]  Chao-Tsung Huang,et al.  eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference , 2019, MICRO.

[36]  Yanzhi Wang,et al.  GraphQ: Scalable PIM-Based Graph Processing , 2019, MICRO.

[37]  William J. Dally,et al.  Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[38]  Jason Cong,et al.  Overcoming Data Transfer Bottlenecks in FPGA-based DNN Accelerators via Layer Conscious Memory Management , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[39]  Bernard Ghanem,et al.  DeepGCNs: Can GCNs Go As Deep As CNNs? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Tian Jin,et al.  Split-CNN: Splitting Window-based Operations in Convolutional Neural Networks for Memory System Optimization , 2019, ASPLOS.

[41]  P. Sadayappan,et al.  Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[42]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[43]  Ge Li,et al.  Mini-batch Serialization: CNN Training with Inter-layer Data Reuse , 2018, MLSys.

[44]  Jung Ho Ahn,et al.  Restructuring Batch Normalization to Accelerate CNN Training , 2018, SysML.

[45]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[46]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Priyanka Raina,et al.  DNN Dataflow Choice Is Overrated , 2018, ArXiv.

[48]  Dylan Malone Stuart,et al.  Memory Requirements for Convolutional Neural Network Hardware Accelerators , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[49]  Jiajun Li,et al.  SmartShuttle: Optimizing off-chip memory accesses for deep learning accelerators , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[50]  Yixin Chen,et al.  Link Prediction Based on Graph Neural Networks , 2018, NeurIPS.

[51]  Christoforos E. Kozyrakis,et al.  GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[52]  Xiao-Ming Wu,et al.  Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning , 2018, AAAI.

[53]  Alex Fout,et al.  Protein Interface Prediction using Graph Convolutional Networks , 2017, NIPS.

[54]  Chao Wang,et al.  CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[55]  Kiyoung Choi,et al.  ExtraV: Boosting Graph Processing Near Storage with a Coherent Accelerator , 2017, Proc. VLDB Endow..

[56]  Carole-Jean Wu,et al.  MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[57]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[58]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[59]  Patrick Judd,et al.  Cnvlutin2: Ineffectual-Activation-and-Weight-Free Deep Neural Network Computing , 2017, ArXiv.

[60]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[61]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[62]  C. Priebe,et al.  Semi-External Memory Sparse Matrix Multiplication for Billion-Node Graphs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[63]  V. Sze,et al.  Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2016, IEEE Journal of Solid-State Circuits.

[64]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[65]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[66]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[67]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[68]  Ozcan Ozturk,et al.  Energy Efficient Architecture for Graph Analytics Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[69]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[70]  Vivienne Sze,et al.  Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[71]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[72]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[73]  Pinar Yanardag,et al.  Deep Graph Kernels , 2015, KDD.

[74]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[75]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[76]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[77]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[78]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[79]  Samuel Williams,et al.  Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[80]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, Knowledge and Information Systems.

[81]  L. Takac DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS , 2012 .

[82]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[83]  Xi Zhang,et al.  A Cache Replacement Policy Using Adaptive Insertion and Re-reference Prediction , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing.

[84]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[85]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[86]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[87]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[88]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..