Svelto: High-Level Synthesis of Multi-Threaded Accelerators for Graph Analytics

Graph analytics are an emerging class of irregular applications. Operating on very large datasets, they present unique behaviors, such as fine-grained, unpredictable memory accesses, and highly unbalanced task level parallelism, that make existing high-performance general-purpose processors or accelerators (e.g., GPUs) suboptimal. To address these issues, research and industry are developing a variety of custom accelerator designs for this application area, including solutions based on reconfigurable devices (Field Programmable Gate Arrays). These new approaches often employ High-Level Synthesis (HLS) to accelerate the development of the accelerators. In this paper, we propose a novel architecture template for the automatic generation of accelerators for graph analytics and irregular applications. The architecture template includes a dynamic task scheduling mechanism, a parallel array of accelerators that enables supporting task-level parallelism with context switching, and a related multi-channel memory interface that decouples communication from computation and provides support for fine-grained atomic memory operations. We discuss the integration of the architectural template in an HLS flow, presenting the necessary modifications to enable automatic generation of the custom architectures starting from OpenMP annotated code. We evaluate our approach first by synthesizing and exploring triangle counting, a common graph algorithm, and then by synthesizing custom designs for a set of graph database benchmark queries, representing series of graph pattern matching routines. We compare the synthesized accelerators with previous state-of-the-art methodologies for the synthesis of parallel architectures, showing that the proposed approach allows reducing resource usage by optimizing the number of accelerators replicas without any performance penalty.

[1]  Robert J. Halstead,et al.  Compiled multithreaded data paths on FPGAs for dynamic workloads , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[2]  Andreas Koch,et al.  Synthesis of interleaved multithreaded accelerators from OpenMP loops , 2017, 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[3]  Yu Wang,et al.  A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[4]  Jinjun Xiong,et al.  Triangle Counting and Truss Decomposition using FPGA , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).

[5]  Jeff Heflin,et al.  LUBM: A benchmark for OWL knowledge base systems , 2005, J. Web Semant..

[6]  David A. Bader,et al.  Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2 , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[7]  Andreas Koch,et al.  Automatic high-level synthesis of multi-threaded hardware accelerators , 2014, 2014 24th International Conference on Field Programmable Logic and Applications (FPL).

[8]  Marco Minutoli,et al.  Efficient synthesis of graph methods: A dynamically scheduled architecture , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[9]  Walid A. Najjar,et al.  CAMs as synchronizing caches for multithreaded irregular applications on FPGAs , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[10]  Vito Giovanni Castellana,et al.  An automated flow for the High Level Synthesis of coarse grained parallel applications , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[11]  Yao Chen,et al.  SoC, NoC and Hierarchical Bus Implementations of Applications on FPGAs Using the FCUDA Flow , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[12]  Peter M. Kogge,et al.  A Case for Migrating Execution for Irregular Applications , 2017, IA3@SC.

[13]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Antonino Tumeo,et al.  Irregular Applications: From Architectures to Algorithms [Guest editors' introduction] , 2015, Computer.

[15]  Marco Minutoli,et al.  High level synthesis of RDF queries for graph analytics , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[16]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  James C. Hoe,et al.  GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[18]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[19]  Haixun Wang,et al.  Trinity: a distributed graph engine on a memory cloud , 2013, SIGMOD '13.

[20]  Viktor K. Prasanna,et al.  An FPGA framework for edge-centric graph processing , 2018, CF.

[21]  Kunle Olukotun,et al.  GraphOps: A Dataflow Library for Graph Analytics Acceleration , 2016, FPGA.

[22]  Vito Giovanni Castellana,et al.  In-Memory Graph Databases for Web-Scale Data , 2015, Computer.

[23]  Vito Giovanni Castellana,et al.  An adaptive Memory Interface Controller for improving bandwidth utilization of hybrid and reconfigurable systems , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Simone Secchi,et al.  Irregular applications: architectures & algorithms , 2011, IA3 '11.

[25]  Zhiru Zhang,et al.  ElasticFlow: A complexity-effective approach for pipelining irregular loop nests , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[26]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[27]  Fabrizio Ferrandi,et al.  Bambu : A Free Framework for the High Level Synthesis of Complex Applications , 2012 .

[28]  Jason Helge Anderson,et al.  From software threads to parallel hardware in high-level synthesis for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).