Memory management units that use low-level AXI descriptor chains to hold irregular graph-oriented access sequences can help improve DRAM memory throughput of graph algorithms by almost an order of magnitude. For the Xilinx Zed board, we explore and compare the memory throughputs achievable when using (1) cache-enabled CPUs with an OS, (2) cache-enabled CPUs running bare metal code, (2) CPU-based control of FPGA-based AXI DMAs, and finally (3) local FPGA-based control of AXI DMA transfers. For short-burst irregular traffic generated from sparse graph access patterns, we observe a performance penalty of almost 10× due to DRAM row activations when compared to cache-friendly sequential access. When using an AXI DMA engine configured in FPGA logic and programmed in AXI register mode from the CPU, we can improve DRAM performance by as much as 2.4× over naïve random access on the CPU. In this mode, we use the host CPU to trigger DMA transfer by writing appropriate control information in the internal register of the DMA engine. We also encode the sparse graph access patterns as locally-stored BRAM-hosted AXI descriptor chains to drive the AXI DMA engines with minimal CPU involvement under Scatter Gather mode. In this configuration, we deliver an additional 3× speedup, for a cumulative throughput improvement of 7× over a CPU-based approach using caches while running an OS to manage irregular access.
[1]
James C. Hoe,et al.
GraphGen: An FPGA Framework for Vertex-Centric Graph Computation
,
2014,
2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.
[2]
Wayne Luk,et al.
A framework for FPGA acceleration of large graph problems: Graphlet counting case study
,
2011,
2011 International Conference on Field-Programmable Technology.
[3]
Nachiket Kapre,et al.
GraphStep: A System Architecture for Sparse-Graph Algorithms
,
2006,
2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.
[4]
Tony M. Brewer,et al.
Instruction Set Innovations for the Convey HC-1 Computer
,
2010,
IEEE Micro.
[5]
Leslie G. Valiant,et al.
A bridging model for parallel computation
,
1990,
CACM.
[6]
James C. Hoe,et al.
CoRAM: an in-fabric memory architecture for FPGA-based computing
,
2011,
FPGA '11.
[7]
Luca Benini,et al.
Energy and performance exploration of accelerator coherency port using Xilinx ZYNQ
,
2013
.