A Template-Based Design Methodology for Graph-Parallel Hardware Accelerators

Graph applications have been gaining importance in the last decade due to emerging big data analytics problems such as Web graphs, social networks, and biological networks. For these applications, traditional CPU and GPU architectures suffer in terms of performance and power consumption due to irregular communications, random memory accesses, and load balancing problems. It has been shown that specialized hardware accelerators can achieve much better power and energy efficiency compared to the general purpose CPUs and GPUs. In this paper, we present a template-based methodology specifically targeted for hardware accelerator design of big-data graph applications. Important architectural features that are key for energy efficient execution are implemented in a common template. The proposed template-based methodology is used to design hardware accelerators for different graph applications with little effort. Compared to an application-specific high-level synthesis methodology, we show that the proposed methodology can generate hardware accelerators with up to $18\boldsymbol \times$ better energy efficiency and requires less design effort.

[1]  Phillip H. Jones,et al.  CyGraph: A Reconfigurable Architecture for Parallel Breadth-First Search , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[2]  Séamas McGettrick,et al.  An FPGA architecture for the Pagerank eigenvector problem , 2008, 2008 International Conference on Field Programmable Logic and Applications.

[3]  Viktor K. Prasanna,et al.  A message-passing multi-softcore architecture on FPGA for Breadth-first Search , 2010, 2010 International Conference on Field-Programmable Technology.

[4]  Yu Wang,et al.  A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[5]  Jianlong Zhong,et al.  Medusa: A Parallel Graph Processing System on Graphics Processors , 2014, SGMD.

[6]  Benjamin Carrión Schäfer,et al.  S2CBench: Synthesizable SystemC Benchmark Suite for High-Level Synthesis , 2014, IEEE Embedded Systems Letters.

[7]  Nachiket Kapre,et al.  GraphStep: A System Architecture for Sparse-Graph Algorithms , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[8]  Magnus Jahre,et al.  Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[9]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[10]  Yu Ting Chen,et al.  A Survey and Evaluation of FPGA High-Level Synthesis Tools , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[11]  Ozcan Ozturk,et al.  Architectural requirements for energy efficient execution of graph analytics applications , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[12]  William J. Dally,et al.  Scaling the Power Wall: A Path to Exascale , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[14]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[15]  Kavitha. Graph Analytics for Big Data , 2017 .

[16]  Charlie Johnson,et al.  IBM Power Edge of Network Processor: A Wire-Speed System on a Chip , 2011, IEEE Micro.

[17]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[18]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[19]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[20]  Thambipillai Srikanthan,et al.  Field programmable gate array-based acceleration of shortest-path computation , 2011, IET Comput. Digit. Tech..

[21]  Gu-Yeon Wei,et al.  MachSuite: Benchmarks for accelerator design and customized architectures , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[22]  Kevin Skadron,et al.  Scaling with Design Constraints: Predicting the Future of Big Chips , 2011, IEEE Micro.

[23]  Ozcan Ozturk,et al.  Energy Efficient Architecture for Graph Analytics Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[24]  Hiroyuki Tomiyama,et al.  CHStone: A benchmark program suite for practical C-based high-level synthesis , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[25]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[26]  Gagan Agrawal,et al.  Efficient and Simplified Parallel Graph Processing over CPU and MIC , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[27]  James C. Hoe,et al.  GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, FCCM 2014.

[28]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .