GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms

Due to the irregular nature of connections in most graph datasets, partitioning graph analysis algorithms across multiple computational nodes that do not share a common memory inevitably leads to large amounts of interconnect traffic. Previous research has shown that FPGAs can outcompete software-based graph processing in shared memory contexts, but it remains an open question if this advantage can be maintained in distributed systems. In this work, we present GraVF-M, a framework designed to ease the implementation of FPGA-based graph processing accelerators for multi-FPGA platforms with distributed memory. Based on a lightweight description of the algorithm kernel, the framework automatically generates optimized RTL code for the whole multi-FPGA design. We exploit an aspect of the programming model to present a familiar message-passing paradigm to the user, while under the hood implementing a more efficient architecture that can reduce the necessary inter-FPGA network traffic by a factor equal to the average degree of the input graph. A performance model based on a theoretical analysis of the factors influencing performance serves to evaluate the efficiency of our implementation. With a throughput of up to 5.8GTEPS (billions of traversed edges per second) on a 4-FPGA system, the designs generated by GraVF-M compare favorably to state-of-the-art frameworks from the literature and reach 94% of the projected performance limit of the system.

[1]  Nachiket Kapre,et al.  Spatial hardware implementation for sparse graph algorithms in GraphStep , 2011, TAAS.

[2]  Jing Li,et al.  Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform , 2018, FPGA.

[3]  Ieee Staff,et al.  2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig) , 2013 .

[4]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[5]  Kiyoung Choi,et al.  ExtraV: Boosting Graph Processing Near Storage with a Coherent Accelerator , 2017, Proc. VLDB Endow..

[6]  Jing Li,et al.  Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search , 2017, FPGA.

[7]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[8]  Phillip H. Jones,et al.  Accelerating all-pairs shortest path using a message-passing reconfigurable architecture , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[9]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[10]  Kunle Olukotun,et al.  GraphOps: A Dataflow Library for Graph Analytics Acceleration , 2016, FPGA.

[11]  Tianshi Chen,et al.  TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[12]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[13]  Yu Wang,et al.  A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[14]  Richard E. Korf,et al.  Multi-Way Number Partitioning , 2009, IJCAI.

[15]  Engelhardt Nina,et al.  Performance-Driven System Generation for Distributed Vertex-Centric Graph Processing on Multi-FPGA Systems , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[16]  Nachiket Kapre Custom FPGA-based soft-processors for sparse graph acceleration , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[17]  Hayden Kwok-Hay So,et al.  Towards Flexible Automatic Generation of Graph Processing Gateware , 2017, HEART.

[18]  Magnus Jahre,et al.  Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[19]  Phillip H. Jones,et al.  CyGraph: A Reconfigurable Architecture for Parallel Breadth-First Search , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[20]  Jing Li,et al.  Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform , 2018, FPGA.

[21]  Oliver Diessel,et al.  International Conference on Field Programmable Technology (FTP 04) , 2004 .

[22]  Derek Chiou,et al.  FPGA-Accelerated Transactional Execution of Graph Workloads , 2017, FPGA.

[23]  Yu Wang,et al.  FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search , 2016, FPGA.

[24]  Wayne Luk,et al.  A framework for FPGA acceleration of large graph problems: Graphlet counting case study , 2011, 2011 International Conference on Field-Programmable Technology.

[25]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[26]  Hayden Kwok-Hay So,et al.  GraVF: A vertex-centric distributed graph processing framework on FPGAs , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[27]  Yu Wang,et al.  ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture , 2017, FPGA.

[28]  Guoqing LEI,et al.  TorusBFS : A Novel Message-passing Parallel Breadth-First Search Architecture on FPGAs , 2015 .

[29]  Viktor K. Prasanna,et al.  An FPGA framework for edge-centric graph processing , 2018, CF.

[30]  James C. Hoe,et al.  GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[31]  Viktor K. Prasanna,et al.  High-Throughput and Energy-Efficient Graph Processing on FPGA , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[32]  Viktor K. Prasanna,et al.  A message-passing multi-softcore architecture on FPGA for Breadth-first Search , 2010, 2010 International Conference on Field-Programmable Technology.

[33]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..